Zero-Downtime Deployments Without the Magic
Rolling updates, health checks, and backward-compatible migrations — the unglamorous mechanics of shipping without taking the site down.
The Problem
You deploy a new version. For a few seconds, requests fail: connections to the old process are cut before the new one is ready, or the new code expects a database column the migration hasn't added yet. Users see 502s. The deploy "worked," but it wasn't invisible.
Why It Matters
If a deploy causes even thirty seconds of errors, you'll subconsciously deploy less often — batching changes, raising the stakes of each release, and making outages more likely. Zero-downtime deploys are what make continuous delivery psychologically safe.
Core Concepts
Three mechanisms do most of the work:
- Rolling updates — replace instances gradually so capacity never drops to zero.
- Readiness probes — don't route traffic to a new instance until it says it's ready.
- Graceful shutdown — let in-flight requests finish before an old instance exits.
Add backward-compatible migrations and you can deploy code and schema changes independently.
Implementation
A readiness probe gates traffic until the app is actually serving:
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 5
Graceful shutdown drains in-flight work on SIGTERM:
process.on("SIGTERM", async () => {
server.close(); // stop accepting new connections
await drainInFlight(); // let active requests finish
await db.end(); // close pools cleanly
process.exit(0);
});
Common Mistakes
- No readiness gate. Traffic hits an instance still warming caches or running migrations, and those first requests fail.
- Destructive migrations shipped with code that needs them. Dropping a column in the same release that stops using it means the old, still-running version breaks.
- Ignoring
SIGTERM. The orchestrator kills the process mid-request.
Production Considerations
Use the expand–contract pattern for schema changes. To rename a column:
- Expand — add the new column; write to both, read from the old.
- Migrate — backfill and switch reads to the new column.
- Contract — once no running code references the old column, drop it.
Each step is backward-compatible, so old and new code coexist during the rollout.
Security
Keep health endpoints free of secrets and cheap to call — they're hit constantly and often unauthenticated. A readiness probe that runs an expensive query becomes a self-inflicted denial of service.
Performance
Set maxSurge and maxUnavailable so capacity stays at or above 100% during the
roll. Surging one extra instance at a time is slower but never dips below your
serving capacity.
Summary
Zero-downtime deploys aren't magic — they're rolling updates, honest readiness probes, graceful shutdown, and migrations designed so two versions can run at once. Get those right and shipping becomes a non-event, which is exactly what you want.
The weekly engineering digest
Production-grade engineering writing in your inbox. No spam, unsubscribe anytime.