Skip to main content
Next.jsApril 17, 20269 min read

Rolling Back a Broken Next.js Deploy Without Downtime

A broken deploy is inevitable. How quickly and cleanly you recover is an infrastructure design decision. This article covers the rollback strategies available for Next.js — from basic to zero-downtime — and when to use each.

Every Next.js application will at some point receive a broken deploy. The question is not whether rollback will be needed, but how long it takes, how much traffic it affects, and whether the process is reliable enough to execute under pressure at 2 AM. This article covers the rollback strategies available for Next.js in order of increasing reliability, and the trade-offs of each.

Why Rollback Is Harder Than It Looks

A Next.js application in production has more state than its own code. A rollback that only reverts the application binary while leaving other state in place can produce a system that is broken in a different way than the original failure.

The state that a rollback must account for:

State typeExamplesRollback concern
Application codeNext.js bundle, server codeThe obvious target
Environment variablesFeature flags, API keysOld code may expect different keys
Database schemaMigrations run by the new deployOld code may not work with new schema
Cache contentsFull Route Cache, Data CacheMay contain data shaped for new code
Client-side cacheBrowser Router CacheUsers may have cached the broken version

A deployment process that does not account for all five can produce silent failures after rollback that are harder to diagnose than the original problem.

Strategy 1 — Manual Redeploy of Previous Version

The simplest rollback: identify the last working image tag or commit, build it again, and deploy it.

Procedure (CI/CD via GitLab CI / GitHub Actions):

# Find the last working tag
git log --oneline -10

# Trigger a deploy of a specific commit
git checkout <commit-hash>
git tag v1.2.3-rollback
git push origin v1.2.3-rollback
# CI pipeline picks up the tag and deploys

What this does well: Works on any infrastructure. No special tooling required.

What this does poorly: Time-to-recovery depends on how long the build takes. If builds take 8 minutes, you have at least 8 minutes of broken production before the rollback is live. Traffic continues hitting the broken version during the build.

When to use: Development environments, staging, or production deployments with very low traffic where 5-10 minutes of downtime is acceptable.

Strategy 2 — Tag-Based Image Rollback

If your CI pipeline produces and pushes Docker images tagged with the commit SHA or version, rollback is a matter of updating the running image without a build step.

# Kubernetes deployment update
kubectl set image deployment/nextjs-app   nextjs=registry.example.com/nextjs-app:v1.2.2   --namespace=production

# Watch the rollout
kubectl rollout status deployment/nextjs-app --namespace=production

Kubernetes performs a rolling update by default — it brings up pods running the old image before terminating pods running the broken image, so traffic continues flowing throughout.

Time to recovery: 60-120 seconds, depending on pod startup time and the readinessProbe configuration.

Prerequisite: Images must be retained in the registry. A registry cleanup policy that deletes images older than N days will eventually remove the image you need. Retain at least the last 10 production images by tag.

# Kubernetes Deployment — key rollback-related settings
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0      # Never take a pod down before the new one is ready
      maxSurge: 1            # Allow one extra pod during the update
  template:
    spec:
      containers:
        - name: nextjs
          image: registry.example.com/nextjs-app:v1.2.3
          readinessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3

With maxUnavailable: 0, Kubernetes will not terminate any running pod until the replacement passes its readiness probe. A broken deploy that makes the health check fail will halt the rollout — old pods stay up, and the deployment can be rolled back with:

kubectl rollout undo deployment/nextjs-app --namespace=production

This is the single most important Kubernetes configuration for Next.js deployments. Without a readiness probe tied to an actual application health check, Kubernetes considers a pod ready as soon as it starts — even if the application is crashing or returning 500s.

Strategy 3 — Blue/Green Deployment

Blue/green maintains two complete environments. At any given time, one environment (blue) is live and the other (green) is idle. Deployments go to green. Traffic is switched at the load balancer level. Rollback is a traffic switch, not a pod operation.

# Simplified Kubernetes Service switching
apiVersion: v1
kind: Service
metadata:
  name: nextjs-app
spec:
  selector:
    app: nextjs-app
    slot: blue    # Change to "green" to switch traffic
  ports:
    - port: 80
      targetPort: 3000

Switching from green back to blue is a single kubectl patch:

kubectl patch service nextjs-app   -p '{"spec":{"selector":{"slot":"blue"}}}'   --namespace=production

Time to recovery: Under 10 seconds. The switch is atomic at the load balancer level — no pod restart, no build, no rolling update.

Cost: Requires double the compute to keep both environments running simultaneously.

What this does not solve: Database schema changes. If the new deploy ran a migration that is not backward-compatible with the old code, switching back to blue will produce database errors. This is the primary reason to prefer additive migrations (add a column, keep the old one) over destructive ones (drop-and-recreate) in continuously deployed applications.

Strategy 4 — Canary Rollback

Canary deployments send a percentage of traffic to the new version while the majority continues on the old version. If the canary shows errors, the traffic percentage is reduced to zero — equivalent to a rollback — without any pod operation on the stable version.

With Kubernetes and an ingress controller that supports traffic splitting (NGINX ingress with nginx.ingress.kubernetes.io/canary annotations, or Argo Rollouts):

# Canary ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nextjs-app-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"
spec:
  rules:
    - host: example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: nextjs-app-canary
                port:
                  number: 80

Setting canary-weight: "0" removes the canary from the traffic pool. The stable deployment continues without interruption.

Time to recovery: Under 5 seconds.

When to use: High-traffic production environments where a broken deploy affecting even 1% of users is significant. Canary is the deployment strategy that catches the most regressions before they affect all users.

Health Check Endpoint

All rollback strategies above depend on a reliable health check. The Next.js application must expose an endpoint that returns a non-200 status when it is not ready to serve traffic.

// app/api/health/route.ts
import { NextResponse } from "next/server";

export async function GET() {
  try {
    // Check database connectivity
    await db.$queryRaw`SELECT 1`;

    return NextResponse.json({ status: "ok" }, { status: 200 });
  } catch (error) {
    return NextResponse.json(
      { status: "error", message: String(error) },
      { status: 503 }
    );
  }
}

A health check that always returns 200 is useless for Kubernetes readiness probes. The probe must verify that the application can actually serve requests — which means checking any critical dependency (database, cache) that would cause user-visible failures if unavailable.

Choosing a Strategy

StrategyTime to recoveryRequires KubernetesZero downtimeHandles schema changes
Manual redeploy5-15 minutesNoNoDepends
Image rollback60-120 secondsRecommendedYes (with readiness probe)No
Blue/greenUnder 10 secondsYesYesNo
CanaryUnder 5 secondsYesYesYes (stable never changes)

For most Next.js applications, image rollback on Kubernetes with a properly configured readiness probe covers the majority of failure scenarios. Blue/green or canary is worth the additional complexity for applications deploying multiple times per day or serving high-value traffic where even a 60-second degradation is unacceptable.

Private DevOps configures image-based rollback with readiness probes and kubectl rollout undo as the baseline for every Next.js Kubernetes deployment, with canary deployments available for teams that want sub-10-second recovery. The configuration is documented in runbooks so any engineer on the team can execute a rollback without prior platform knowledge.

Need help with this?

Our team handles this kind of work daily. Let us take care of your infrastructure.