Guidedeploymill://guides/health

Health endpoint & deploy/rollback health gate (deploymill)

deploymill keys every "is this deploy good?" decision off one app-owned health endpoint. deploy, rollback, get_app_health, and auto-rollback all probe the same path and apply the same rule:

200 means everything's good. Anything else — a non-200 status, a connection refused, or a timeout — means the deploy is bad → roll back.

There is one function and one rule. Put your app's real readiness checks in that handler and return 200 only when they all pass.

The contract

Every web app exposes one health endpoint. The default path is /healthz (the starter templates already serve a real 200 there).
The handler should assert whatever "ready" means for your app — DB reachable, migrations ran, a required file exists, an upstream API answers — and return 200 only when all of them pass. Return a non-2xx (e.g. 503) when any check fails.
An agent debugging a runtime issue can extend this handler to assert whatever it needs; the gate then enforces it on every deploy.

Probe semantics

The probe HEADs the resolved health path on each attached domain.

Strict mode (a real health path like /healthz): 200 = healthy. Any other status, a connection error, or a timeout = unhealthy. A 500 or 503 from the app counts as a failure (unlike the legacy gateway-only rule).
Lenient mode (the bare root /, the opt-out): only 502/503/504 or "can't connect" count as unhealthy — any 2xx–4xx proves the edge routes. This is the pre-DET-120 behavior.
Fail N times → unhealthy. The probe does not short-circuit on the first healthy response (during a rolling update the old container may still be answering). It requires N consecutive healthy probes to confirm health and declares failure only after N consecutive failures (including timeouts), spaced by intervalMs. This is the "if it fails n times it rolls back" behavior.
404 fallback. If a strict health path returns 404 (the endpoint simply isn't there — e.g. an older app that never added /healthz), the probe falls back to a lenient root / probe instead of treating the missing endpoint as a hard failure. Add the /healthz handler to get the strict gate.

Declarative config

Add a health block to .deploymill/project.json:

{
  "health": { "path": "/healthz", "retries": 3, "intervalMs": 3000, "timeoutMs": 5000 }
}

field	default	meaning
`path`	`/healthz`	Path to probe. Set to `"/"` to opt out of strict mode (lenient root probe).
`retries`	`3`	Consecutive healthy probes to confirm health / consecutive failures to declare dead.
`intervalMs`	`3000`	Spacing between probe attempts.
`timeoutMs`	`5000`	Per-attempt request timeout. A timeout counts as a failure.

reconcile_project mirrors the resolved block into the app's metadata so deploy (a primitive that never reads project.json) can read it. reconcile's plan.health reports { current, desired, action, orchestratorGate }.

Defaults / back-compat. Omitting the block keeps existing apps working: the probe still defaults to /healthz but falls back to lenient / on a 404, and no orchestrator Swarm HEALTHCHECK is wired (no platform-layer behavior change for apps that didn't ask for it). New web apps are scaffolded with the block declared.

Orchestrator-level gate (declaring `health` opts in)

When a web app declares a health block, reconcile_project also wires the same endpoint into the container's Swarm HEALTHCHECK, with a start-first / FailureAction: rollback update policy. That means the orchestrator:

won't shift traffic to the new task until it passes the health check, and
auto-rolls-back the service update if the new task never goes healthy within the monitor window.

Our post-deploy probe then becomes confirmation, not the sole gate — and the rolling-update drain race (a stale 200 from the draining old container) is closed at the platform layer.

Notes:

Only applies when the Dokploy instance runs in Swarm mode; otherwise the setting is stored and ignored.
The in-container probe command is stack-specific (busybox wget on the Node/alpine images, python on the Python image). For an undetectable stack (or reconcile called with a raw config object instead of repoUrl), the orchestrator gate is skipped with a warning and the post-deploy probe gate still applies — wiring a HEALTHCHECK that can't run inside the image would brick the rollout.
Wiring is best-effort: a Dokploy rejection is surfaced as a warning and never fails the reconcile.

Auto-rollback keys off this gate

With "rollback": "auto", a deploy whose health endpoint doesn't return 200 within retries attempts is automatically reverted — and it reverts to the most recent earlier deploy that was recorded healthy (not merely the previous image, which may also be broken). See the rollback guide for the full auto-rollback flow. deploy records each deploy's health verdict so this "last healthy" target selection works.

Writing a real health handler

Node (Hono):

app.get("/healthz", async (c) => {
  try {
    await pool.query("SELECT 1"); // DB reachable + migrations ran?
    return c.json({ ok: true });
  } catch {
    return c.json({ ok: false }, 503); // not ready → deploy stays on the old image
  }
});

Python (FastAPI):

@app.get("/healthz")
def healthz(response: Response) -> dict:
    try:
        engine.connect().close()
        return {"ok": True}
    except Exception:
        response.status_code = 503
        return {"ok": False}

What NOT to do

Don't make /healthz heavy. It's hit repeatedly on every deploy and by the orchestrator HEALTHCHECK. Keep checks fast (a SELECT 1, a file stat), not a full integration test.
Don't return 200 unconditionally if you care about readiness. A handler that always returns 200 defeats the gate — the deploy will look healthy even when the DB is down.
Don't put liveness-only logic here and expect readiness gating. One endpoint, readiness semantics: 200 iff the app can actually serve.
Don't probe a path that needs auth or a body. The probe is an unauthenticated HEAD. Keep /healthz open and HEAD-able.