Types of pod autoscaling

Autoscaling can occur in two dimensions:

  1. Horizontal Pod Autoscaling (HPA): Scales the number of pod replicas in response to workload demands.
  2. Vertical Pod Autoscaling (VPA): Adjusts the CPU and memory resources allocated to individual pods to match their needs.

Horizontal Pod Autoscaling (HPA)

HPA is a fundamental feature in Kubernetes that automatically scales the number of pod replicas based on observed metrics such as CPU utilization, memory usage, or even custom metrics. It is particularly useful in applications with varying levels of traffic or computational demand, allowing for dynamic scaling in and out.

How HPA Works

HPA operates by continuously monitoring the metrics of running pods and adjusting the number of replicas to match the desired utilization target. The HPA controller in Kubernetes compares the current usage against the target metrics defined by the user and calculates the required number of replicas.

Here’s a detailed step-by-step process of how HPA functions:

  1. Metrics Collection: Kubernetes gathers resource usage data from the pods via the Metrics Server or a custom metrics pipeline (such as Prometheus).
  2. Target Setting: You define a target utilization (e.g., 70% CPU usage) in your HPA configuration. The HPA controller uses this target to determine whether to scale the number of pods up or down.
  3. Calculation of Desired Replicas: The HPA controller uses the formula:Desired Replicas=Current UtilizationTarget Utilization×Current Number of Replicas\text{Desired Replicas} = \frac{\text{Current Utilization}}{\text{Target Utilization}} \times \text{Current Number of Replicas}Desired Replicas=Target UtilizationCurrent Utilization​×Current Number of ReplicasFor example, if your current CPU usage is 140% of the target and you have 5 replicas running, the desired number of replicas would be 10.
  4. Scaling Decision: Based on this calculation, the HPA controller adjusts the number of replicas in the deployment. Kubernetes then either creates or deletes pods to achieve the desired state.
  5. Cool-Down Period: After scaling, Kubernetes enforces a cool-down period to prevent rapid scaling in response to temporary spikes in demand. This helps avoid unnecessary churn and ensures stability.

HPA Configuration Example

To configure HPA, you can use the kubectl autoscale command or define an HPA resource in a YAML file. Here’s an example of configuring HPA for a deployment:


  apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

In this example, the HPA will scale the number of pods in the my-app deployment between 3 and 10, depending on whether the average CPU utilization exceeds or falls below 70%.

Advanced HPA with Custom Metrics

While HPA traditionally uses CPU and memory metrics, Kubernetes allows for more advanced scaling scenarios using custom metrics. This enables scaling based on application-specific metrics such as request latency, queue length, or even external metrics from monitoring systems like Prometheus.

To use custom metrics with HPA, you need to define the metrics in a custom metrics API and configure your HPA to use these metrics. Here’s a brief overview of how it works:

  1. Deploy Metrics Adapter: Install a metrics adapter like Prometheus Adapter to expose custom metrics to the Kubernetes API.
  2. Define Custom Metrics: Use your application’s instrumentation (e.g., Prometheus) to collect custom metrics that are relevant for scaling.
  3. Configure HPA for Custom Metrics: Modify your HPA configuration to use the custom metrics, specifying the metric name and target value.

Example:

yamlCopy codeapiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: custom-metric-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-custom-app
  minReplicas: 2
  maxReplicas: 15
  metrics:
  - type: Pods
    pods:
      metric:
        name: request_duration_seconds
      target:
        type: AverageValue
        averageValue: 300ms

This configuration scales the my-custom-app deployment based on the request_duration_seconds custom metric, ensuring that response times remain within acceptable limits.

Best Practices for HPA

  • Start with Conservative Limits: Begin with conservative scaling limits to avoid rapid scale-ups or downs that could destabilize your application.
  • Monitor Scaling Events: Regularly monitor HPA activity to understand scaling behavior and adjust targets or thresholds as needed.
  • Use Custom Metrics Wisely: Only use custom metrics that are critical to your application’s performance, and ensure that the metrics are stable and reliable.

Vertical Pod Autoscaling (VPA)

Vertical Pod Autoscaling (VPA) is designed to automatically adjust the CPU and memory resources allocated to individual pods, optimizing resource utilization for workloads with varying resource demands. Instead of changing the number of pods, VPA modifies the resources within the existing pods, making it ideal for applications where the workload remains constant but resource requirements fluctuate.

How VPA Works

VPA continuously monitors the resource utilization of pods and compares it with the resource requests and limits defined in the pod specifications. Based on this comparison, VPA recommends or automatically adjusts the resources for each pod.

Here’s how VPA operates:

  1. Monitoring Resource Usage: VPA collects real-time data on CPU and memory usage from the pods, using the Kubernetes Metrics Server or custom metrics pipelines.
  2. Recommendation Engine: The VPA recommendation engine analyzes this data and suggests adjustments to the pod’s resource requests. These recommendations aim to balance resource usage with performance, avoiding over-provisioning or under-provisioning.
  3. Resource Adjustment: Depending on the VPA configuration, it can either suggest changes (which administrators can review and apply) or automatically update the resource requests in the pod’s specification. If necessary, the pod may be restarted to apply the new resource limits.
  4. Resource Update: The VPA controller then updates the pod specification with the new resource limits, and Kubernetes restarts the pod if needed to apply the changes.

VPA Configuration Example

Configuring VPA involves creating a VPA resource that specifies how the resources should be adjusted for a given deployment or stateful set. Here’s an example of a basic VPA configuration:

yamlCopy codeapiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Auto"

In this example, the VPA will automatically adjust the CPU and memory resources for pods in the my-app deployment.

Best Practices for VPA

  • Gradual Changes: Apply changes gradually to avoid drastic shifts in resource allocation that could disrupt the application’s performance.
  • Monitor Recommendations: Regularly review VPA recommendations, especially in the initial stages, to ensure that they align with expected performance.
  • Combine with HPA: For applications with both fluctuating traffic and varying resource demands, consider combining HPA and VPA to achieve both horizontal and vertical scaling.

Use cases for HPA and VPA

When deciding between HPA and VPA, consider the nature of your application’s workload:

  • Use HPA: If your application experiences variable traffic patterns that require scaling in the number of instances, HPA is the appropriate choice. It’s ideal for front-end services, APIs, and other stateless applications that need to scale out to handle increased traffic.
  • Use VPA: If your application has stable traffic but varying resource requirements (e.g., batch processing jobs, machine learning workloads), VPA is more suitable. It allows each pod to efficiently use its allocated resources, minimizing waste.
  • Combine HPA and VPA: In some cases, the best approach is to use both HPA and VPA together. This hybrid model allows you to scale out when traffic increases while also optimizing resource usage within each pod.

Further resources

Kubernetes – Horizontal Pod Autoscaling

Github – Kubernetes Autoscaling