Skip to main content
Back to Articles
Next.jsMarch 22, 202612 min read

How We Run Next.js at Scale on K3s with Zero Downtime

A production-grade guide to running Next.js on K3s with zero downtime — container registry, CI/CD pipelines, rolling updates, Cloudflare CDN and Tunnel, Prometheus monitoring, and automated cache purging.

Introduction

We manage dozens of Next.js applications for our clients across K3s clusters. Over time, we have built a deployment pipeline that is fast, reliable, and achieves true zero-downtime deployments. This article describes our production setup: from code push to live traffic, fully automated through CI/CD.

This is not a theoretical guide — this is what we run in production today.

Architecture Overview

Developer pushes to main branch
    |
    v
CI/CD pipeline triggers (GitLab CI / GitHub Actions)
    |
    v
Docker build + push to Container Registry (GitLab Registry / ECR / GHCR)
    |
    v
kubectl set image → rolling update on K3s
    |
    v
Cloudflare cache purge (automatic)
    |
    v
Traffic: Cloudflare CDN → Tunnel → K3s Service → Next.js pods

1. CI/CD Pipeline

We use GitLab CI (works the same with GitHub Actions) to build and push images to a container registry. No manual steps — push to main and the pipeline handles everything.

GitLab CI Example (.gitlab-ci.yml)

stages:
  - build
  - deploy

variables:
  IMAGE: $CI_REGISTRY_IMAGE/app
  TAG: $CI_COMMIT_SHORT_SHA

build:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
    - docker build
        --cache-from $IMAGE:latest
        -t $IMAGE:$TAG
        -t $IMAGE:latest
        -f Dockerfile .
    - docker push $IMAGE:$TAG
    - docker push $IMAGE:latest
  only:
    - main

deploy:
  stage: deploy
  image: bitnami/kubectl:latest
  script:
    - kubectl set image deployment/myapp app=$IMAGE:$TAG -n myapp
    - kubectl rollout status deployment/myapp -n myapp --timeout=300s
    - |
      curl -s -X POST \
        "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/purge_cache" \
        -H "Authorization: Bearer $CF_API_TOKEN" \
        -H "Content-Type: application/json" \
        --data '{"purge_everything":true}'
  only:
    - main

GitHub Actions Alternative

name: Deploy
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}/app:${{ github.sha }}
      - uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBECONFIG }}
      - run: |
          kubectl set image deployment/myapp \
            app=ghcr.io/${{ github.repository }}/app:${{ github.sha }} \
            -n myapp
          kubectl rollout status deployment/myapp -n myapp --timeout=300s

Why Container Registry?

  • Immutable tags — every deployment is traceable to a specific commit SHA
  • Rollback is instantkubectl set image to a previous tag, no rebuild needed
  • Multi-node clusters — all nodes pull from the registry, no manual image distribution
  • CI/CD native — GitLab, GitHub, ECR all have built-in registries
  • Cache layers — subsequent builds are fast thanks to Docker layer caching

2. K3s Cluster Configuration

Our K3s setup is deliberately simple:

# Install K3s without the default ingress controller
curl -sfL https://get.k3s.io | sh -s - \
  --disable traefik \
  --disable servicelb \
  --write-kubeconfig-mode 644

We disable Traefik and the default service load balancer because all traffic arrives through Cloudflare Tunnel. There is no need for an ingress controller or external load balancer.

Deployment Manifest

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: myapp
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: myapp
          image: registry.example.com/myapp:latest
          ports:
            - containerPort: 3000
          env:
            - name: NODE_ENV
              value: "production"
            - name: NODE_OPTIONS
              value: "--max-old-space-size=384"
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: "1"
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 15
            periodSeconds: 30
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]

3. Cloudflare Integration

Cloudflare Tunnel

We run cloudflared as a Kubernetes deployment with 2 replicas for redundancy:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cloudflared
  namespace: myapp
spec:
  replicas: 2
  selector:
    matchLabels:
      app: cloudflared
  template:
    metadata:
      labels:
        app: cloudflared
    spec:
      containers:
        - name: cloudflared
          image: cloudflare/cloudflared:latest
          args:
            - tunnel
            - --no-autoupdate
            - run
            - --token
            - $(TUNNEL_TOKEN)
          env:
            - name: TUNNEL_TOKEN
              valueFrom:
                secretKeyRef:
                  name: cf-tunnel-token
                  key: token
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 200m
              memory: 128Mi

CDN Caching Rules

We configure Cloudflare to cache static assets aggressively but bypass the cache for API routes and dynamic pages:

  • /_next/static/*: Cache for 1 year (immutable hashed filenames).
  • /images/*, /fonts/*: Cache for 30 days.
  • /api/*: Bypass cache.
  • Everything else: Cache for 1 hour with stale-while-revalidate.

Automatic Cache Purge After Deploy

The deploy script purges the entire Cloudflare cache after a successful rollout. For targeted purging, we can purge by prefix:

curl -s -X POST \
  "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/purge_cache" \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"prefixes":["myapp.example.com/_next/"]}'

4. Monitoring with Prometheus

We run Prometheus and Grafana on the K3s cluster to monitor application and cluster health.

Key Metrics We Track

  • Request latency (P50, P95, P99) via Node.js metrics
  • Pod CPU and memory usage via kube-state-metrics
  • Pod restart count (indicates OOM kills or crash loops)
  • HTTP error rates (5xx responses)
  • Deployment rollout duration

ServiceMonitor for Next.js

We expose metrics from the Next.js application using a custom endpoint:

// app/api/metrics/route.ts
import { NextResponse } from "next/server";

let requestCount = 0;

export async function GET() {
  requestCount++;
  const memUsage = process.memoryUsage();
  const metrics = [
    `# HELP nodejs_heap_used_bytes Node.js heap used`,
    `# TYPE nodejs_heap_used_bytes gauge`,
    `nodejs_heap_used_bytes ${memUsage.heapUsed}`,
    `# HELP nodejs_heap_total_bytes Node.js heap total`,
    `# TYPE nodejs_heap_total_bytes gauge`,
    `nodejs_heap_total_bytes ${memUsage.heapTotal}`,
    `# HELP http_requests_total Total HTTP requests`,
    `# TYPE http_requests_total counter`,
    `http_requests_total ${requestCount}`,
  ].join("\n");

  return new NextResponse(metrics, {
    headers: { "Content-Type": "text/plain" },
  });
}

Alerting Rules

groups:
  - name: nextjs-alerts
    rules:
      - alert: HighMemoryUsage
        expr: container_memory_usage_bytes{container="myapp"} / container_spec_memory_limit_bytes{container="myapp"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Next.js pod memory above 85%"

      - alert: PodRestartLoop
        expr: increase(kube_pod_container_status_restarts_total{container="myapp"}[1h]) > 3
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Next.js pod restarting frequently"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5%"

5. Results

With this setup, our deployments typically complete in under 3 minutes from git push to live traffic:

  • CI build + push: 60-120 seconds (cached layers)
  • Rolling update: 30-60 seconds
  • Cache purge: 2 seconds
  • Total: Under 3 minutes, fully automated

Zero requests are dropped during the rolling update thanks to the readiness probe, preStop hook, and maxUnavailable: 0 configuration. Rollbacks are instant — just point to a previous image tag.

This pipeline has been serving us and our clients reliably for over a year. The key principles: container registry for immutable, traceable deployments; K3s for lightweight Kubernetes; Cloudflare for CDN and secure tunneling; Prometheus for observability. Every deployment is automated, every image is tagged with its commit SHA, and every rollback is one command away.

Need help with this?

Our team handles this kind of work daily. Let us take care of your infrastructure.