Health endpoint & deploy/rollback health gate (deploymill)
deploymill keys every "is this deploy good?" decision off one app-owned health endpoint. deploy, rollback, get_app_health, and auto-rollback all probe the same path and apply the same rule:
200 means everything's good. Anything else — a non-200 status, a connection refused, or a timeout — means the deploy is bad → roll back.
There is one function and one rule. Put your app's real readiness checks in that handler and return 200 only when they all pass.
The contract
- Every web app exposes one health endpoint. The default path is
/healthz(the starter templates already serve a real 200 there). - The handler should assert whatever "ready" means for your app — DB reachable, migrations ran, a required file exists, an upstream API answers — and return 200 only when all of them pass. Return a non-2xx (e.g.
503) when any check fails. - An agent debugging a runtime issue can extend this handler to assert whatever it needs; the gate then enforces it on every deploy.
Probe semantics
The probe HEADs the resolved health path on each attached domain.
- Strict mode (a real health path like
/healthz):200= healthy. Any other status, a connection error, or a timeout = unhealthy. A500or503from the app counts as a failure (unlike the legacy gateway-only rule). - Lenient mode (the bare root
/, the opt-out): only502/503/504or "can't connect" count as unhealthy — any2xx–4xxproves the edge routes. This is the pre-DET-120 behavior. - Fail N times → unhealthy. The probe does not short-circuit on the first healthy response (during a rolling update the old container may still be answering). It requires N consecutive healthy probes to confirm health and declares failure only after N consecutive failures (including timeouts), spaced by
intervalMs. This is the "if it fails n times it rolls back" behavior. - 404 fallback. If a strict health path returns
404(the endpoint simply isn't there — e.g. an older app that never added/healthz), the probe falls back to a lenient root/probe instead of treating the missing endpoint as a hard failure. Add the/healthzhandler to get the strict gate.
Declarative config
Add a health block to .deploymill/project.json:
{
"health": { "path": "/healthz", "retries": 3, "intervalMs": 3000, "timeoutMs": 5000 }
}
| field | default | meaning |
|---|---|---|
path | /healthz | Path to probe. Set to "/" to opt out of strict mode (lenient root probe). |
retries | 3 | Consecutive healthy probes to confirm health / consecutive failures to declare dead. |
intervalMs | 3000 | Spacing between probe attempts. |
timeoutMs | 5000 | Per-attempt request timeout. A timeout counts as a failure. |
reconcile_project mirrors the resolved block into the app's metadata so deploy (a primitive that never reads project.json) can read it. reconcile's plan.health reports { current, desired, action, orchestratorGate }.
Defaults / back-compat. Omitting the block keeps existing apps working: the probe still defaults to /healthz but falls back to lenient / on a 404, and no orchestrator Swarm HEALTHCHECK is wired (no platform-layer behavior change for apps that didn't ask for it). New web apps are scaffolded with the block declared.
Orchestrator-level gate (declaring health opts in)
When a web app declares a health block, reconcile_project also wires the same endpoint into the container's Swarm HEALTHCHECK, with a start-first / FailureAction: rollback update policy. That means the orchestrator:
- won't shift traffic to the new task until it passes the health check, and
- auto-rolls-back the service update if the new task never goes healthy within the monitor window.
Our post-deploy probe then becomes confirmation, not the sole gate — and the rolling-update drain race (a stale 200 from the draining old container) is closed at the platform layer.
Notes:
- Only applies when the Dokploy instance runs in Swarm mode; otherwise the setting is stored and ignored.
- The in-container probe command is stack-specific (busybox
wgeton the Node/alpine images,pythonon the Python image). For an undetectable stack (orreconcilecalled with a rawconfigobject instead ofrepoUrl), the orchestrator gate is skipped with a warning and the post-deploy probe gate still applies — wiring a HEALTHCHECK that can't run inside the image would brick the rollout. - Wiring is best-effort: a Dokploy rejection is surfaced as a warning and never fails the reconcile.
Auto-rollback keys off this gate
With "rollback": "auto", a deploy whose health endpoint doesn't return 200 within retries attempts is automatically reverted — and it reverts to the most recent earlier deploy that was recorded healthy (not merely the previous image, which may also be broken). See the rollback guide for the full auto-rollback flow. deploy records each deploy's health verdict so this "last healthy" target selection works.
Writing a real health handler
Node (Hono):
app.get("/healthz", async (c) => {
try {
await pool.query("SELECT 1"); // DB reachable + migrations ran?
return c.json({ ok: true });
} catch {
return c.json({ ok: false }, 503); // not ready → deploy stays on the old image
}
});
Python (FastAPI):
@app.get("/healthz")
def healthz(response: Response) -> dict:
try:
engine.connect().close()
return {"ok": True}
except Exception:
response.status_code = 503
return {"ok": False}
What NOT to do
- Don't make
/healthzheavy. It's hit repeatedly on every deploy and by the orchestrator HEALTHCHECK. Keep checks fast (aSELECT 1, a filestat), not a full integration test. - Don't return 200 unconditionally if you care about readiness. A handler that always returns 200 defeats the gate — the deploy will look healthy even when the DB is down.
- Don't put liveness-only logic here and expect readiness gating. One endpoint, readiness semantics: 200 iff the app can actually serve.
- Don't probe a path that needs auth or a body. The probe is an unauthenticated
HEAD. Keep/healthzopen and HEAD-able.