Guidedeploymill://guides/rollback

Rollback reference (deploymill)

deploymill supports image-swap rollback: after a bad deploy, swap the running container back to the image of a previous deploy in seconds — no rebuild required.

How it works

When rollback: true is set in .deploymill/project.json and reconcile_project has been run:

Reconcile wires up a container registry on the app (using credentials configured on the server).
It enables rollback recording on the application.
The next deploy builds the image as normal, then pushes it to the registry and creates a rollback record (with a rollbackId).
list_deployments exposes each deploy's rollbackId.
rollback swaps the running image to whichever rollbackId you pass — the platform pulls the image from the registry and restarts the container.

Enabling rollback

Set "rollback": true in .deploymill/project.json.
Commit + push.
Run reconcile_project with the app's applicationId and repoUrl. Reconcile flips the rollback toggle and configures the container registry.
Run deploy. This is the first deploy whose image will be available to roll back to. Deploys made before enabling rollback have no captured image.

Requires the container registry to be configured on the deploymill server. If it's not, reconcile fails with a clear error.

Performing a rollback

1. list_deployments({ applicationId })       → find the deploy you want to revert to; copy its rollbackId
2. rollback({ applicationId, rollbackId })   → image swap completes in ~seconds

The swap is non-destructive to the registry — the current image stays available, so you can roll forward again (rollback to the deploy you just rolled away from).

What rollback does NOT do

Doesn't roll back database migrations. If the deploy you're reverting from added a column or table, the image you're rolling back to will likely throw at startup. Treat rollback as for code-only changes; schema changes need a forward fix or a manual alembic downgrade / node-pg-migrate down.
Doesn't roll back env vars. set_env_vars is not history-tracked. If a bad deploy went out with a wrong env var, fix the env var first (set_env_vars again) — otherwise the rolled-back code will hit the same bad value.
Doesn't roll back data writes. Reverting code that corrupted data only restores the code — the data stays corrupted.
Doesn't roll back mounts or domain changes. Those are app-level config, not per-deploy.

When to use rollback

✅ A code change broke a route or threw an exception at startup. Roll back, fix in a PR, redeploy.
✅ A perf regression you can't immediately diagnose. Roll back to buy time.
✅ A misconfigured runtime behavior (wrong feature flag default, wrong static file).
⚠️ A schema migration broke things. Rollback alone may leave the schema ahead of the code. Usually you need a forward fix + redeploy, or a manual downgrade first.
❌ Data corruption. Restore from backup, not from a rollback record.

Automatic rollback (self-healing)

Set "rollback": "auto" (instead of true) in .deploymill/project.json, commit, reconcile. This enables rollback recording exactly like true, and arms post-deploy self-healing: when a deploy builds and swaps a new image but the health gate comes back unhealthy, deploy automatically reverts to a known-good image — no second tool call from you.

The health gate is the trigger. Auto-rollback keys off the same health-endpoint contract deploy/rollback/get_app_health use: a deploy whose health endpoint (default /healthz) doesn't return 200 within retries consecutive attempts is unhealthy. See the health guide (deploymill://guides/health) for the contract, the health config block, and the strict-vs-lenient / 404-fallback rules. Put your real readiness checks in /healthz so "healthy" means what you need it to mean.
You don't pass anything to deploy; the intent is persisted by reconcile_project and deploy acts on it. reconcile's plan.rollback.auto shows whether it's armed.
**Reverts to the last healthy deploy, not just the previous one.** deploy records each deploy's health verdict; auto-rollback walks back to the most recent earlier deploy that was recorded healthy (falling back to the most recent earlier image when nothing has a recorded result). So stacking two broken deploys won't land you on the second-broken one.
deploy's response carries an autoRollback object and an autoRollbackNote:
- { attempted: true, toLastHealthy: true, recovered: true, … } → reverted to the last recorded-healthy image and it recovered. Fix forward in a PR.
- { attempted: true, recovered: false, … } → reverted but it's still unhealthy (degraded). Something broader is wrong (DB, dependency, or the target image is also bad) — investigate with get_logs / get_app_health.
- { attempted: false, reason: "no_rollback_point" } → there was no earlier captured image to revert to (the first deploy after enabling rollback). The bad image is still live; fix forward.
It triggers only when the new image is live but all edges fail the health gate (a single-domain blip won't trip it), and it reverts at most once (no flapping). Workers have no domains, so auto-rollback never fires for them.
Orchestrator-level gate (recommended). Declaring a health block also wires the endpoint into the container's Swarm HEALTHCHECK with a start-first / failure-action=rollback rollout, so the platform won't cut over to — or complete the rollout on — an unhealthy new task, and self-heals at the orchestrator layer before our probe even runs. See the health guide.

Disabling rollback

Set "rollback": false in .deploymill/project.json, commit, reconcile. The next deploy won't push to the registry; existing rollback records remain queryable but new ones won't accumulate. Disabling also disarms auto-rollback.

What NOT to do

Don't rollback past a migration. Check what shipped in the gap; if a migration ran, forward-fix instead.

Troubleshooting

list_deployments shows no rollbackId on recent deploys → rollback was never enabled, OR the deploy ran before reconcile turned rollback on. Confirm with a dryRun: true reconcile to see the current toggle state.
rollback fails with auth error → registry credentials expired or were rotated on the server. The operator needs to refresh them.
Rolled-back container won't start → likely a schema mismatch. The image is older than the current DB schema. Forward-fix or manual downgrade.