Slash Your AKS Costs: Run Resilient Production Workloads on Azure Spot VMs
Cloud computing offers incredible scalability, but costs can quickly escalate. For organizations leveraging Azure Kubernetes Service (AKS), one powerful strategy to significantly reduce compute expenses without compromising reliability is by utilizing Azure Spot Virtual Machines. Spot VMs provide access to Azure’s unused compute capacity at substantial discounts compared to pay-as-you-go prices.
However, the journey to production-ready Spot VM adoption isn’t without its challenges. Like many teams, we wanted to move our production services over to Spot VMs as much as possible to cut costs. But the reality of Spot VMs quickly became apparent: they can be evicted, and depending on which machine type you select when creating the node pool, eviction rates can be quite high—many have 20% or higher eviction rates. Even more concerning, you risk having your entire pool evicted at once.
The initial challenge seems straightforward: if your deployment configuration isn’t set up to offload to non-spot pools during spot downtime, you’ll experience actual downtime. But having deployment pods moved to non-spot pools when spot pools aren’t available is just part of the problem. The real complexity emerges when the spot pool comes back online—how do you move the deployment pods back into the spot pool to maintain your cost savings?
And then there’s the cluster autoscaler puzzle: how do you ensure your pods are actually triggering the autoscaler to scale the newly restored spot pool so that everything gets moved back efficiently? These are the exact challenges I’ve struggled with in production environments.
This guide will walk you through the complete solution—from configuring your AKS clusters to harness the cost benefits of Spot VMs, to implementing intelligent failover and failback mechanisms that ensure high availability during spot pool fluctuations. We’ll explore how to combine AKS features, Kubernetes best practices, and tools like the descheduler
to build a resilient system that automatically handles the complex dance between spot and on-demand resources, ensuring you get maximum cost savings without sacrificing reliability.
Azure Spot VMs offer up to 90% cost savings by providing access to Azure’s unused compute capacity. However, they come with the trade-off that Azure can reclaim them when capacity is needed for on-demand workloads.
For a comprehensive understanding of Azure Spot VMs, including workload suitability, setup instructions, and basic configuration patterns, check out our detailed guide: Azure Spot VMs in AKS: Complete Guide to Cost-Effective Kubernetes Workloads.
This article focuses on the advanced challenges of running production workloads on spot instances—specifically the complex problems of automatic failover and failback that most teams encounter when trying to maximize cost savings while maintaining reliability.
While setting up basic spot node pools is straightforward, the real complexity emerges in production scenarios where you need:
The challenge isn’t just moving workloads off spot nodes during evictions—it’s ensuring they automatically return to spot nodes when capacity becomes available again, and doing so in a way that triggers the cluster autoscaler appropriately.
This guide will walk you through implementing the following components to create a production-ready spot VM architecture:
aks-node-termination-handler
https://maksim-paskal.github.io/aks-node-termination-handler/
deliveryhero/cluster-overprovisioner
https://charts.deliveryhero.io/
descheduler/descheduler
https://kubernetes-sigs.github.io/descheduler/
cluster-autoscaler-priority-expander
- Defines node pool priority for cost optimizationBefore implementing advanced Spot VM strategies, ensure you have:
worker-spot-pool
) and one on-demand pool (e.g., worker-on-demand-pool
)To maximize cost savings, configure the Cluster Autoscaler to prefer Spot node pools over more expensive on-demand pools using the priority
expander strategy.
First, update your AKS cluster to use the priority expander:
az aks update -g myResourceGroup -n myAKSCluster --cluster-autoscaler-profile expander=priority
Next, create a ConfigMap that defines the priority order. The ConfigMap name must be exactly as shown below:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
20:
- .*spot.*
10:
- .*
This configuration instructs the Cluster Autoscaler to:
AKS automatically tags nodes in Spot node pools with kubernetes.azure.com/scalesetpriority: spot
, enabling the regex pattern matching. With this setup, your workloads will automatically benefit from Spot pricing when capacity is available, with seamless fallback to on-demand instances when needed.
Azure Spot VMs can be evicted with a 30-second notice when Azure needs the capacity back. To handle these evictions gracefully, we need to ensure pods are properly drained from nodes before termination.
Configure appropriate terminationGracePeriodSeconds
in your pod specifications to allow applications time to shut down cleanly:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
spec:
terminationGracePeriodSeconds: 60 # Allow 60 seconds for graceful shutdown
containers:
- name: my-app
# ... container spec
For production environments, consider implementing the AKS Node Termination Handler to proactively drain nodes when eviction notices are received. This handler monitors Azure’s scheduled events and initiates pod eviction before the node is forcibly terminated.
Affinity rules are crucial for distributing workloads across failure domains, minimizing the impact of simultaneous Spot VM evictions. Use podAntiAffinity
to spread replicas strategically:
topology.kubernetes.io/zone
to distribute pods across different zoneskubernetes.io/hostname
to ensure replicas don’t colocate on the same nodeapiVersion: apps/v1
kind: Deployment
metadata:
name: my-critical-app
spec:
replicas: 3
template:
spec:
affinity:
nodeAffinity:
# Prefer Spot nodes but allow on-demand as fallback
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: kubernetes.azure.com/scalesetpriority
operator: In
values:
- spot
podAntiAffinity:
# Strongly prefer spreading across zones
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-critical-app
topologyKey: topology.kubernetes.io/zone
# Secondary preference: spread across nodes
- weight: 50
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-critical-app
topologyKey: kubernetes.io/hostname
tolerations:
# Allow scheduling on Spot nodes
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
Pod Disruption Budgets (PDBs) limit the number of pods that can be simultaneously evicted, preventing service outages during node drains. This is essential when multiple Spot nodes are evicted simultaneously.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-critical-app-pdb
spec:
minAvailable: 2 # Ensure at least 2 pods remain available
selector:
matchLabels:
app: my-critical-app
Alternative configuration using percentage:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-critical-app-pdb
spec:
maxUnavailable: 25% # Allow up to 25% of pods to be unavailable
selector:
matchLabels:
app: my-critical-app
PDBs work with the eviction API to ensure that when nodes are drained (either due to Spot evictions or maintenance), your services maintain minimum availability requirements.
When Spot nodes are evicted, evicted pods need somewhere to land immediately. Cluster overprovisioning creates “headroom” by running low-priority placeholder pods that can be quickly evicted to make room for your critical workloads.
Use the cluster-overprovisioner Helm chart:
helm repo add deliveryhero https://charts.deliveryhero.io/
helm install cluster-overprovisioner deliveryhero/cluster-overprovisioner -f overprovisioner-values.yaml
# overprovisioner-values.yaml
fullnameOverride: "overprovision"
deployments:
- name: spot
replicaCount: 1
resources:
requests:
cpu: 2 # Reserve 1 CPU core
memory: 4Gi # Reserve 2GB memory
nodeSelector:
kubernetes.azure.com/scalesetpriority: spot
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
For environments with varying cluster sizes, automatically scale overprovisioning based on the number of Spot nodes:
# Install cluster-proportional-autoscaler
kubectl apply -f https://github.com/kubernetes-sigs/cluster-proportional-autoscaler/releases/latest/download/cluster-proportional-autoscaler.yaml
Configure it to scale overprovisioning pods:
apiVersion: apps/v1
kind: Deployment
metadata:
name: overprovision-scaler
namespace: kube-system
spec:
replicas: 1
template:
spec:
containers:
- name: autoscaler
image: registry.k8s.io/cpa/cluster-proportional-autoscaler:1.8.8
command:
- /cluster-proportional-autoscaler
- --namespace=default
- --configmap=overprovision-config
- --target=deployment/overprovision-spot
- --nodelabels=kubernetes.azure.com/scalesetpriority=spot
- --logtostderr=true
- --v=2
---
apiVersion: v1
kind: ConfigMap
metadata:
name: overprovision-config
namespace: default
data:
linear: |-
{
"coresPerReplica": 4,
"nodesPerReplica": 1,
"preventSinglePointFailure": true,
"includeUnschedulableNodes": true
}
This configuration creates one overprovisioning pod for every 4 CPU cores of Spot nodes, ensuring proportional headroom as your cluster scales.
While the Cluster Autoscaler Priority Expander ensures new workloads prefer spot nodes, it doesn’t automatically move existing pods from on-demand nodes back to spot nodes when spot capacity becomes available again. This is where the Kubernetes Descheduler becomes essential.
The Descheduler is a Kubernetes component that identifies and evicts pods that are running on suboptimal nodes based on configurable policies. In our spot VM architecture, it serves as the intelligent failback mechanism that moves workloads from expensive on-demand nodes back to cost-effective spot nodes when capacity returns.
When spot nodes become available again after an eviction event, your workloads may still be running on more expensive on-demand nodes. The Descheduler continuously monitors the cluster and identifies pods that violate node affinity preferences—specifically, pods that prefer spot nodes but are currently running on on-demand nodes.
By evicting these pods, the Descheduler forces the Kubernetes scheduler to re-evaluate their placement according to their affinity rules, which will prefer the newly available spot nodes. Combined with cluster overprovisioning, this creates a cascading effect:
Install the Descheduler using Helm:
helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/
helm install descheduler descheduler/descheduler --namespace kube-system -f descheduler-values.yaml --set kind=Deployment
Create a values file (descheduler-values.yaml
) that configures the Descheduler to run on spot nodes and target pods violating node affinity preferences:
deschedulerPolicyAPIVersion: "descheduler/v1alpha2"
kind: Deployment
# Run descheduler itself on spot nodes to save costs
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
deschedulerPolicy:
profiles:
- name: nodeAffinity-profile
pluginConfig:
- args:
nodeAffinityType:
- preferredDuringSchedulingIgnoredDuringExecution
name: RemovePodsViolatingNodeAffinity
- args:
evictLocalStoragePods: true
name: DefaultEvictor
plugins:
deschedule:
enabled:
- RemovePodsViolatingNodeAffinity
preferredDuringSchedulingIgnoredDuringExecution
affinity rulesevictLocalStoragePods: true
allows eviction of pods with local storageThe Descheduler provides metrics to monitor its effectiveness:
# Check descheduler logs
kubectl logs -n kube-system deployment/descheduler
# Monitor pod evictions
kubectl get events --field-selector reason=Evicted
Key metrics to watch:
The Descheduler works synergistically with cluster overprovisioning, and this integration is absolutely essential for the system to function properly. Here’s why overprovisioning is the final missing piece of the puzzle:
Without Overprovisioning - The Infinite Loop Problem:
With Overprovisioning - The Solution:
The key insight is that the Descheduler itself doesn’t trigger autoscaling events - it only evicts pods. Without overprovisioning to create the necessary “landing space” and trigger scaling, the Descheduler would be stuck in an endless eviction-reschedule loop, making no actual progress toward cost optimization.
This creates a self-regulating system that automatically optimizes cost distribution while maintaining the safety net of overprovisioning.
Let’s be real—cloud bills can get out of hand fast, especially when you’re running production workloads. But with a little bit of Kubernetes magic and some clever Azure features, you can seriously cut costs without losing sleep over reliability.
Here’s a quick, friendly recap of how all the moving parts come together:
AKS with Cluster Autoscaler
Your cluster grows and shrinks as needed, so you’re not paying for idle machines.
Azure Spot VMs
Super affordable compute power—just remember, Azure can take them back at any time. Great for saving money if you’re ready for a little unpredictability.
Separate Node Pools (Spot + On-Demand)
Keep your spot and regular nodes in their own groups. This way, you can control where your apps run and always have a backup plan.
Cluster Autoscaler Priority Expander
Tells Kubernetes to always try the cheap spot nodes first, and only use the pricier on-demand nodes if it has to.
Node Affinity & Anti-Affinity
Spreads your apps out so a single spot node going down doesn’t take everything with it.
Pod Disruption Budgets (PDBs)
Makes sure not too many pods get evicted at once, so your service stays up.
Cluster Overprovisioning
Runs “placeholder” pods to keep some space free. When spot nodes vanish, your real apps can jump in right away.
Cluster Proportional Autoscaler
Adjusts the number of placeholder pods as your cluster changes size, so you always have just enough wiggle room.
Kubernetes Descheduler
When spot nodes come back, this little helper moves your apps back to the cheaper nodes—so you keep saving.
AKS Node Termination Handler
Gives your apps a heads-up before a spot node is evicted, so they can shut down gracefully (no more “surprise, you’re gone!”).
Automatic Failover:
Spot nodes disappear? No problem—your apps move to on-demand nodes, all on their own.
Smart Failback:
Spot nodes return? Your apps slide back over, and your wallet thanks you.
Always Chasing the Best Price:
The whole setup is like a self-driving car for your cloud costs—always steering you toward the cheapest, safest route.
With this setup, you get the best of both worlds: big savings from spot VMs and the peace of mind that your production workloads are safe. It takes a bit of tinkering to get right, but once it’s humming, your cluster will handle the hard work for you. More savings, less stress—what’s not to love?
Happy hacking, and may your cloud bills always be tiny!
Comments
Reply on Bluesky here to join the conversation.