Skip to main content
Back to Articles
Server & DevOpsOctober 8, 20258 min read

Kubernetes HPA Deep Dive: Scaling Under Load

Configure Kubernetes Horizontal Pod Autoscaler with custom metrics, scaling policies, and behavior tuning for predictable production auto-scaling.

Beyond Basic CPU Autoscaling

The Horizontal Pod Autoscaler (HPA) is one of Kubernetes' most powerful features, yet most deployments use only the simplest configuration — scale when CPU exceeds 50 percent. In production, CPU-based scaling often reacts too slowly or too aggressively, leading to either request queuing or resource waste.

This guide covers advanced HPA configuration including custom metrics, scaling behaviors, and stabilization windows that we implement as part of our Kubernetes management practice.

How HPA Works Internally

The HPA controller runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). It queries the metrics API, calculates the desired replica count using the formula:

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

The controller then applies scaling policies, stabilization windows, and tolerance thresholds before issuing a scale command.

The Tolerance Band

By default, HPA has a 10 percent tolerance (--horizontal-pod-autoscaler-tolerance=0.1). If the ratio of current to desired metric value falls within 0.9 to 1.1, no scaling action occurs. This prevents oscillation but can delay response to genuine load changes.

Multi-Metric HPA Configuration

Production services should scale based on multiple signals:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 75
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
        - type: Pods
          value: 5
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 120
      selectPolicy: Min

Understanding the Behavior Section

Scale-up uses a 30-second stabilization window and two policies: add 50 percent of current replicas or 5 pods, whichever is greater (selectPolicy: Max). This allows aggressive scale-up during traffic spikes.

Scale-down uses a 300-second (5-minute) stabilization window and a conservative policy: remove at most 10 percent of replicas every 2 minutes (selectPolicy: Min). This prevents the cluster from scaling down too quickly after a traffic spike, which would cause problems if traffic returns.

Custom Metrics with Prometheus

The built-in CPU and memory metrics are limited. Real application health is better measured by request latency, queue depth, or error rates.

Setting Up the Metrics Pipeline

Install the Prometheus Adapter to expose custom metrics to the HPA:

# prometheus-adapter-config.yaml
rules:
  - seriesQuery: 'http_request_duration_seconds_count{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)_count$"
      as: "http_requests_per_second"
    metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'

  - seriesQuery: 'rabbitmq_queue_messages{namespace!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
    name:
      as: "queue_depth"
    metricsQuery: 'avg(<<.Series>>{<<.LabelMatchers>>})'

Verifying Custom Metrics

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" | jq .

Scaling Based on Queue Depth

For worker deployments that process jobs from a message queue, scale based on queue depth rather than CPU:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-worker
  minReplicas: 2
  maxReplicas: 30
  metrics:
    - type: Object
      object:
        describedObject:
          apiVersion: v1
          kind: Service
          name: rabbitmq
        metric:
          name: queue_depth
        target:
          type: Value
          value: "50"

This configuration maintains roughly 50 messages per worker pod. As the queue grows, new workers spawn. As it drains, workers scale down.

Scaling Pitfalls to Avoid

Cold Start Penalty

If your application takes 30 seconds to start, newly scaled pods cannot serve traffic during that window. Combine HPA with proper readiness probes and consider overprovisioning by setting minReplicas high enough to handle average load without scaling.

Resource Requests and Limits

HPA CPU percentage is calculated against resources.requests, not limits. If requests are set too low, the HPA sees artificially high utilization and over-scales:

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 1Gi

Set requests to match your P50 usage and limits to match P99 usage.

Cluster Autoscaler Interaction

HPA adds pods, but those pods need nodes to run on. Ensure the Cluster Autoscaler is configured with matching instance types and can provision nodes faster than your scale-up rate. A node that takes 5 minutes to provision defeats the purpose of a 30-second scaling reaction.

Monitoring HPA Performance

Track these metrics to verify HPA effectiveness:

  • kube_hpa_status_current_replicas vs kube_hpa_spec_max_replicas — if current regularly hits max, increase the ceiling
  • kube_hpa_status_condition{condition="ScalingLimited"} — indicates HPA wants to scale but cannot
  • Request latency P99 during scale-up events — reveals whether scaling is fast enough

Use these metrics to iteratively tune stabilization windows and scaling policies for your specific workload patterns. For hands-on help configuring auto-scaling, explore our server optimization services.

Need help with this?

Our team handles this kind of work daily. Let us take care of your infrastructure.