Skip to content
$EngineeringAtlas

Zero-Downtime Deployments Without the Magic

Rolling updates, health checks, and backward-compatible migrations — the unglamorous mechanics of shipping without taking the site down.

Amit Kumar Singh2 min read

The Problem

You deploy a new version. For a few seconds, requests fail: connections to the old process are cut before the new one is ready, or the new code expects a database column the migration hasn't added yet. Users see 502s. The deploy "worked," but it wasn't invisible.

Why It Matters

If a deploy causes even thirty seconds of errors, you'll subconsciously deploy less often — batching changes, raising the stakes of each release, and making outages more likely. Zero-downtime deploys are what make continuous delivery psychologically safe.

Core Concepts

Three mechanisms do most of the work:

  1. Rolling updates — replace instances gradually so capacity never drops to zero.
  2. Readiness probes — don't route traffic to a new instance until it says it's ready.
  3. Graceful shutdown — let in-flight requests finish before an old instance exits.

Add backward-compatible migrations and you can deploy code and schema changes independently.

Implementation

A readiness probe gates traffic until the app is actually serving:

readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 3
  periodSeconds: 5

Graceful shutdown drains in-flight work on SIGTERM:

process.on("SIGTERM", async () => {
  server.close();              // stop accepting new connections
  await drainInFlight();       // let active requests finish
  await db.end();              // close pools cleanly
  process.exit(0);
});

Common Mistakes

  • No readiness gate. Traffic hits an instance still warming caches or running migrations, and those first requests fail.
  • Destructive migrations shipped with code that needs them. Dropping a column in the same release that stops using it means the old, still-running version breaks.
  • Ignoring SIGTERM. The orchestrator kills the process mid-request.

Production Considerations

Use the expand–contract pattern for schema changes. To rename a column:

  1. Expand — add the new column; write to both, read from the old.
  2. Migrate — backfill and switch reads to the new column.
  3. Contract — once no running code references the old column, drop it.

Each step is backward-compatible, so old and new code coexist during the rollout.

Security

Keep health endpoints free of secrets and cheap to call — they're hit constantly and often unauthenticated. A readiness probe that runs an expensive query becomes a self-inflicted denial of service.

Performance

Set maxSurge and maxUnavailable so capacity stays at or above 100% during the roll. Surging one extra instance at a time is slower but never dips below your serving capacity.

Summary

Zero-downtime deploys aren't magic — they're rolling updates, honest readiness probes, graceful shutdown, and migrations designed so two versions can run at once. Get those right and shipping becomes a non-event, which is exactly what you want.

Amit Kumar Singh

// written by

Amit Kumar Singh

Software engineer writing about backend systems, cloud, and the realities of running code in production.

$ subscribe --weekly

The weekly engineering digest

Production-grade engineering writing in your inbox. No spam, unsubscribe anytime.

## related

[Backend]▲ trending

Designing Idempotent APIs That Survive Retries

Networks fail, clients retry, and duplicate requests happen. Here's how to design write endpoints that produce the same result no matter how many times they're called.

Amit Kumar Singh3 min read