Beyond Basic CPU Autoscaling
The Horizontal Pod Autoscaler (HPA) is one of Kubernetes' most powerful features, yet most deployments use only the simplest configuration — scale when CPU exceeds 50 percent. In production, CPU-based scaling often reacts too slowly or too aggressively, leading to either request queuing or resource waste.
This guide covers advanced HPA configuration including custom metrics, scaling behaviors, and stabilization windows that we implement as part of our Kubernetes management practice.
How HPA Works Internally
The HPA controller runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). It queries the metrics API, calculates the desired replica count using the formula:
desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))
The controller then applies scaling policies, stabilization windows, and tolerance thresholds before issuing a scale command.
The Tolerance Band
By default, HPA has a 10 percent tolerance (--horizontal-pod-autoscaler-tolerance=0.1). If the ratio of current to desired metric value falls within 0.9 to 1.1, no scaling action occurs. This prevents oscillation but can delay response to genuine load changes.
Multi-Metric HPA Configuration
Production services should scale based on multiple signals:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 5
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 120
selectPolicy: Min
Understanding the Behavior Section
Scale-up uses a 30-second stabilization window and two policies: add 50 percent of current replicas or 5 pods, whichever is greater (selectPolicy: Max). This allows aggressive scale-up during traffic spikes.
Scale-down uses a 300-second (5-minute) stabilization window and a conservative policy: remove at most 10 percent of replicas every 2 minutes (selectPolicy: Min). This prevents the cluster from scaling down too quickly after a traffic spike, which would cause problems if traffic returns.
Custom Metrics with Prometheus
The built-in CPU and memory metrics are limited. Real application health is better measured by request latency, queue depth, or error rates.
Setting Up the Metrics Pipeline
Install the Prometheus Adapter to expose custom metrics to the HPA:
# prometheus-adapter-config.yaml
rules:
- seriesQuery: 'http_request_duration_seconds_count{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_count$"
as: "http_requests_per_second"
metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'
- seriesQuery: 'rabbitmq_queue_messages{namespace!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
name:
as: "queue_depth"
metricsQuery: 'avg(<<.Series>>{<<.LabelMatchers>>})'
Verifying Custom Metrics
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" | jq .
Scaling Based on Queue Depth
For worker deployments that process jobs from a message queue, scale based on queue depth rather than CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-worker
minReplicas: 2
maxReplicas: 30
metrics:
- type: Object
object:
describedObject:
apiVersion: v1
kind: Service
name: rabbitmq
metric:
name: queue_depth
target:
type: Value
value: "50"
This configuration maintains roughly 50 messages per worker pod. As the queue grows, new workers spawn. As it drains, workers scale down.
Scaling Pitfalls to Avoid
Cold Start Penalty
If your application takes 30 seconds to start, newly scaled pods cannot serve traffic during that window. Combine HPA with proper readiness probes and consider overprovisioning by setting minReplicas high enough to handle average load without scaling.
Resource Requests and Limits
HPA CPU percentage is calculated against resources.requests, not limits. If requests are set too low, the HPA sees artificially high utilization and over-scales:
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
Set requests to match your P50 usage and limits to match P99 usage.
Cluster Autoscaler Interaction
HPA adds pods, but those pods need nodes to run on. Ensure the Cluster Autoscaler is configured with matching instance types and can provision nodes faster than your scale-up rate. A node that takes 5 minutes to provision defeats the purpose of a 30-second scaling reaction.
Monitoring HPA Performance
Track these metrics to verify HPA effectiveness:
kube_hpa_status_current_replicasvskube_hpa_spec_max_replicas— if current regularly hits max, increase the ceilingkube_hpa_status_condition{condition="ScalingLimited"}— indicates HPA wants to scale but cannot- Request latency P99 during scale-up events — reveals whether scaling is fast enough
Use these metrics to iteratively tune stabilization windows and scaling policies for your specific workload patterns. For hands-on help configuring auto-scaling, explore our server optimization services.
Need help with this?
Our team handles this kind of work daily. Let us take care of your infrastructure.
Related Articles
The Ultimate Guide to Linux Server Management in 2025
A comprehensive guide to modern Linux server management covering automation, containerization, cloud integration, AI-driven operations, security best practices, and essential tooling for 2025.
Server & DevOpsFixing "421 Misdirected Request" for Plesk Sites on Ubuntu 22.04 After Apache Update
Resolve the 421 Misdirected Request error affecting all HTTPS sites on Plesk for Ubuntu 22.04 after an Apache update, caused by changed SNI requirements in the nginx-to-Apache proxy chain.
Server & DevOpsHow to Set Up GlusterFS on Ubuntu
A complete guide to setting up a distributed, replicated GlusterFS filesystem across multiple Ubuntu 22.04 nodes, including installation, volume creation, client mounting, maintenance, and troubleshooting.