How the Workflow Engine Works
This page explains what happens under the hood when a Loopstack workflow runs — how state is persisted, how transactions protect against partial failures, and how retries work.
The Core Idea: Durable State Machines
A Loopstack workflow is a state machine backed by a database. Each step (transition) moves the machine from one place to another. The current place and all workflow state are written to PostgreSQL after every step — not just at the end.
This means:
- A workflow that pauses waiting for human input can wait for minutes, hours, or days
- If your server restarts mid-run, the workflow resumes from exactly the last completed step
- Failed transitions don’t corrupt state — each step is atomic
Execution Model
Jobs, Not Threads
Transitions don’t run in a long-lived thread. Each processing pass is a BullMQ job:
- A job is enqueued when a workflow is started or when a user triggers a
wait: truetransition - The BullMQ worker inside your NestJS app picks up the job
- The engine loads the latest checkpoint from PostgreSQL, then runs as many auto-transitions as it can in a loop
- The job ends when it hits a
wait: truetransition orend— or when it encounters an error
The loop looks like this conceptually:
load state from latest checkpoint
while (there are auto-transitions available):
run the next transition
save checkpoint to PostgreSQL
if error: handle and stop
end when: wait | end | errorThis is why workflows survive server restarts: the job ends and the state is safely in PostgreSQL. When the next job runs (triggered by a user action or retry), it picks up exactly from where the previous job left off.
Transactions and Rollback
Each transition runs inside a database transaction:
start transaction
→ call your transition method
→ save updated state to workflow entity
→ write a new checkpoint row
commit (or rollback on error)If your transition method throws an error:
- The database transaction is rolled back — neither the new state nor the checkpoint is written
- The in-memory document cache is restored to its pre-transition snapshot — documents saved during the failed transition are discarded
- The error is recorded on the workflow entity
- The workflow stays at its current place — as if the transition never ran
This is what “automatic rollback” means in practice: a partially-executed transition leaves no trace. The workflow stays exactly where it was before the attempt.
ErrorDocument
After the rollback completes, the engine writes a single ErrorDocument to the run with the thrown error’s message. This is the only side-effect of a failed transition that survives the rollback — it lives outside the failed transaction so the audit trail is preserved. Subsequent retry attempts each add their own ErrorDocument, giving a full per-attempt history.
ErrorDocument is a built-in document type (@loopstack/common) that workflow authors don’t construct manually — it’s an engine-managed artifact. See Error Handling for the retry/recovery patterns that consume it.
Checkpoints
After every successful transition, the engine writes a checkpoint row to PostgreSQL. A checkpoint contains:
- The current place (e.g.
waiting_for_approval) - A snapshot of the state object at that point in time
- The document IDs visible at that transition
- A version number that increments with each step
When the next job runs, it loads the latest checkpoint and restores state from it. This is the mechanism that makes workflows resumable across restarts.
Retries
When a transition fails, the engine checks the retry configuration on that transition. By default, workflows don’t retry automatically — they stay at the failed place for manual retry via the Studio UI.
To configure automatic retries:
@Transition({
from: 'calling_api',
to: 'api_complete',
retry: {
attempts: 3,
delay: 1000, // ms before first retry
backoff: 'exponential',
maxDelay: 30000, // cap at 30 seconds
},
})
async callExternalApi(state: MyState): Promise<MyState> {
// ...
}How exponential backoff is calculated:
attempt 1: delay × 2^0 = 1000ms
attempt 2: delay × 2^1 = 2000ms
attempt 3: delay × 2^2 = 4000ms (capped at maxDelay)The delay is implemented as a BullMQ job delay — the job is re-queued with a scheduled execution time. Nothing blocks during the wait.
After retries are exhausted:
If all retry attempts fail, you can route the workflow to a custom error-handling place:
retry: {
attempts: 3,
place: 'error_handling', // go here instead of staying at current place
}Without place, the workflow stays at the current place with hasError: true. The Studio toolbar shows a Retry button that re-triggers the failed transition immediately.
Wait Transitions
wait: true transitions are fundamentally different from auto-transitions. They don’t run automatically — they pause the workflow and wait for an external trigger.
When the engine encounters a place where the only available transition has wait: true:
- The job ends — the workflow is saved with
status: waiting - Nothing more runs until an external trigger arrives
- The trigger can come from: a Studio button click, a sub-workflow callback, or an API call
When the trigger arrives, a new BullMQ job is enqueued with the transition payload. The job loads the checkpoint, executes the wait transition (passing the payload to the method), and then continues the auto-transition loop from the new place.
This is why a paused workflow survives restarts: the trigger doesn’t need to arrive while the original job is running. It can arrive seconds or weeks later — the checkpoint is in PostgreSQL waiting for it.
Timeouts
Each transition has a default timeout of 5 minutes (DEFAULT_TRANSITION_TIMEOUT=300000). If a transition doesn’t complete within the timeout, it’s treated as a failure and the retry logic applies.
Override the default for specific transitions:
@Transition({
from: 'processing',
to: 'done',
timeout: 60000, // 1 minute for this transition
})
async processData(state: MyState): Promise<MyState> {
// ...
}Set timeout: 0 to disable the timeout entirely for a specific transition.
Workflow Lifecycle States
Each workflow run has a status from the WorkflowState enum, persisted on the workflow entity. The engine sets it as the run progresses.
| State | Description | Set by |
|---|---|---|
pending | Run created, not yet picked up by a worker (or re-queued after a wait/error). | Initial value when a run is created, and again when a wait transition is resumed. |
running | A worker is currently executing a transition for this run. | Root processor at the start of each job. |
waiting | Run is paused — either waiting for an external trigger (wait transition) or waiting for a retry/sub-workflow. | Processor when a job ends without reaching end and without an error that fails the run. |
completed | Final transition reached to: 'end'. The result field holds the return value. | Processor when a transition moves to end. |
failed | A transition errored and retries are exhausted (or the error has no retry). | Processor on terminal failure. |
canceled | Cancellation was requested — recursively applied to all child runs. Parent callback fires with canceled status. | WorkflowOrchestrationService.cancel(). |
paused | Reserved — currently not set by the engine. Treated like waiting in queries (e.g. dashboard “action required”). | — |
Terminal states are completed, failed, and canceled — no further work is scheduled for them, and any parent sub-workflow callback fires once the terminal state is reached.
Stateless Execution
A workflow can run in stateless mode via WorkflowRunner.runSync({ stateless: true, … }). The engine skips all persistence and runs the entire state machine in a single pass:
- No database transactions
- No checkpoints
- Documents created via
documentStore.save()live in memory for the duration of the run and are returned in the result — they are not written to the database - All transitions run synchronously in one pass; the call returns when the workflow reaches
end(or errors)
Use stateless execution when:
- Testing workflows — assert against the returned result and in-memory documents without spinning up Postgres/Redis (see
createStatelessContextin@loopstack/testing). - One-off computations — synchronous request/response flows where you want the workflow logic but don’t need durability or resumability (e.g. an internal API endpoint that computes something via a workflow and returns immediately).
- Dry-runs / previews — verify a workflow’s output for a given input without leaving a record.
Stateless runs cannot pause: wait: true transitions and sub-workflow queue() calls aren’t useful here, because there’s no run to resume. Use the regular durable path for those.
See Programmatic Execution for the runSync() API.
Next Steps
- Error Handling — retry configs, custom error places, and the Studio retry button
- Architecture Overview — how the engine fits into the broader system
- Why Documents Exist — the document store and its relationship to workflow state