Zero-Downtime Deployments Without the Magic

The Problem

You deploy a new version. For a few seconds, requests fail: connections to the old process are cut before the new one is ready, or the new code expects a database column the migration hasn't added yet. Users see 502s. The deploy "worked," but it wasn't invisible.

Why It Matters

If a deploy causes even thirty seconds of errors, you'll subconsciously deploy less often — batching changes, raising the stakes of each release, and making outages more likely. Zero-downtime deploys are what make continuous delivery psychologically safe.

Core Concepts

Three mechanisms do most of the work:

Rolling updates — replace instances gradually so capacity never drops to zero.
Readiness probes — don't route traffic to a new instance until it says it's ready.
Graceful shutdown — let in-flight requests finish before an old instance exits.

Add backward-compatible migrations and you can deploy code and schema changes independently.

Implementation

A readiness probe gates traffic until the app is actually serving:

readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 3
  periodSeconds: 5

Graceful shutdown drains in-flight work on SIGTERM:

process.on("SIGTERM", async () => {
  server.close();              // stop accepting new connections
  await drainInFlight();       // let active requests finish
  await db.end();              // close pools cleanly
  process.exit(0);
});

Common Mistakes

No readiness gate. Traffic hits an instance still warming caches or running migrations, and those first requests fail.
Destructive migrations shipped with code that needs them. Dropping a column in the same release that stops using it means the old, still-running version breaks.
Ignoring SIGTERM. The orchestrator kills the process mid-request.

Production Considerations

Use the expand–contract pattern for schema changes. To rename a column:

Expand — add the new column; write to both, read from the old.
Migrate — backfill and switch reads to the new column.
Contract — once no running code references the old column, drop it.

Each step is backward-compatible, so old and new code coexist during the rollout.

Security

Keep health endpoints free of secrets and cheap to call — they're hit constantly and often unauthenticated. A readiness probe that runs an expensive query becomes a self-inflicted denial of service.

Performance

Set maxSurge and maxUnavailable so capacity stays at or above 100% during the roll. Surging one extra instance at a time is slower but never dips below your serving capacity.

Summary

Zero-downtime deploys aren't magic — they're rolling updates, honest readiness probes, graceful shutdown, and migrations designed so two versions can run at once. Get those right and shipping becomes a non-event, which is exactly what you want.

Zero-Downtime Deployments Without the Magic

The Problem

Why It Matters

Core Concepts

Implementation

Common Mistakes

Production Considerations

Security

Performance

Summary

Amit Kumar Singh

The weekly engineering digest

## related

Designing Idempotent APIs That Survive Retries

Postgres Connection Pooling, Explained Properly

Caching Strategies and the Invalidation Trap

The Problem#

Why It Matters#

Core Concepts#

Implementation#

Common Mistakes#

Production Considerations#

Security#

Performance#

Summary#

Amit Kumar Singh

The weekly engineering digest

## related

Designing Idempotent APIs That Survive Retries

Postgres Connection Pooling, Explained Properly

Caching Strategies and the Invalidation Trap

The Problem

Why It Matters

Core Concepts

Implementation

Common Mistakes

Production Considerations

Security

Performance

Summary