Easily build complex reports
Monitoring and efficiency metrics
Custom cost allocation tags
Network cost visibility
Organizational cost hierarchies
Budgeting and budget alerts
Discover active resources
Consumption-based insights
Alerts for unexpected charges
Automated AWS cost savings
Discover cost savings
Unified view of AWS discounts
COGS and business metrics
Model savings plans
Collaborate on cost initiatives
Create and manage your teams
Automate cloud infrastructure
Cloud cost issue tracking
Detect cost spikes
by Emily Dunenfeld
Contents
Engineering teams run inference workloads on Kubernetes clusters for easier scaling and job management. But GPUs, one of the most expensive resources you can run in the cloud, are easy to set up in Kubernetes and forget about the cost. Most teams don’t have visibility or cost savings methods implemented for GPU workloads, so they often overpay without knowing it.
Maintaining GPUs with Kubernetes is easier than provisioning standalone GPU VMs and managing them manually, especially when workloads vary throughout the day or week. It also provides cost visibility and optimization benefits.
For one, you can configure your cluster to run multiple jobs on the same GPU. Instead of dedicating an entire GPU to a single workload, Kubernetes can help you maximize usage across jobs—saving you both capacity and money.
It also gives you deeper visibility into GPU usage. With the right tools in place, you can attribute GPU memory consumption to specific jobs, users, or namespaces. Combined with a scheduler that supports autoscaling, Kubernetes can dynamically consolidate workloads and scale GPU nodes up or down as needed.
In contrast, running standalone GPU VMs often leads to over-provisioning and idle capacity, without the insight or flexibility to fix it.
Each of the three main cloud providers has its own GPU offerings for Kubernetes.
Cloud providers differ in GPU model availability, pricing, and regional coverage. However, common challenges such as idle usage, memory inefficiency, and lack of observability persist across platforms.
Cloud pricing is typically per instance-hour, not based on actual GPU usage. You’re billed for the full GPU allocation, regardless of utilization. Billing dimensions usually include: GPU instance size and family, number of GPUs per node, total node uptime, and any attached storage or networking resources.
Pricing is listed in the US East region.
Charges include the underlying EC2 instance, a version support fee of $0.10 per cluster per hour, and additional per-second Auto Mode charges when enabled.
AWS EKS current generation EC2 GPU instances. (scroll to see full table)
Azure charges for the underlying VM chosen as well as a charge per cluster hour, $0.10 per cluster per hour for the Standard tier and $0.60 per cluster per hour for the Premium tier.
Azure AKS current generation VM GPU instances (scroll to see full table)
GCP charges for the underlying Compute Engine instances and a GKE cluster management fee: $0.10 per cluster per hour for the Standard edition and $0.00822 per vCPU per hour for the Enterprise edition.
You can choose from the following GPU VM types.
GCP GKE current generation VM GPU instances (scroll to see full table)
Or, you can also attach GPUs to N1 VMs manually, listed below (costs also apply for the N1 instance, billed $0.03 per hour).
GCP GKE current generation N1 GPU attachments (scroll to see full table)
As you can see, the hourly price is inherently expensive, but selecting the right size and running workloads efficiently can make a big difference.
Most GPU waste comes from choosing an instance that’s bigger than necessary.
A big reason is because GPU processing time is hard to measure accurately. Metrics can be noisy and inconsistent. Memory usage, on the other hand, is more stable and measurable, and it’s often the best proxy for cost efficiency.
You can calculate the idle memory by comparing memory used by workload to available memory per GPU.
idle_memory = total_allocated_memory - used_memory
Then, determine the number of GPUs needed to get a much better idea of which instance type is the best fit.
To do this automatically for certain NVIDIA workloads, the Vantage Kubernetes agent integrates with NVIDIA DCGM and automatically calculates GPU idle costs by attributing GPU memory usage per workload. This provides a granular view of how memory is consumed, helping you avoid over-provisioning and improve cost efficiency.
Kubernetes autoscalers can dynamically consolidate workloads and scale GPU nodes up or down as needed. This ensures you’re only paying for the resources you actually need. Proper autoscaling configuration requires careful tuning of thresholds and scaling policies to avoid oscillation while still responding promptly to changing demands.
While On-Demand pricing gives flexibility, committing to compute usage via Savings Plans can significantly reduce the hourly instance costs for long-running GPU workloads.
For example, the g6.48xlarge Savings Plan rate is $8.69 per hour compared to the $13.35 On-Demand rate, 35% less.
Running multiple workloads on a single GPU is one of the most effective ways to reduce cost, especially for smaller models or batch inference jobs that don’t fully utilize GPU resources. There are a few strategies available, each applicable and listed as options by the three main cloud providers.
Time slicing provides a simple approach to GPU sharing by allocating GPU time slices to different containers on a time-rotated basis. This works well for workloads that don’t require constant GPU access but still benefit from acceleration.
Read more about how to implement:
NVIDIA CUDA Multi-Process Service (MPS) allows processes to share the same GPU. In Kubernetes, this is supported through fractional GPU memory requests. Instead of requesting a full GPU, you can request only the memory your workload needs. For example, you can specify a resource request like nvidia.com/gpu-memory: 4Gi to request a subset of memory on a shared GPU.
nvidia.com/gpu-memory: 4Gi
NVIDIA Multi-instance GPUs (MIGs) are applicable for certain NVIDIA GPUs, such as NVIDIA’s A100 Tensor Core. It provides hardware-level partitioning of a GPU into multiple isolated instances. Each MIG instance has its own memory, cache, and compute cores, providing better isolation than software-based sharing methods.
By right-sizing GPU instances, enabling autoscaling, and leveraging sharing strategies like time slicing or MIGs, teams can significantly reduce GPU costs in Kubernetes. These optimizations not only lower spend but also improve resource efficiency.
Use FinOps as Code to create your cloud cost infrastructure with Pulumi.
We process thousands of background jobs daily. Temporal's reliability, scalability, and debuggability gave us better visibility and helped us to ensure that critical ETL processes run efficiently.
S3 Tables simplifies analytics by bringing managed Iceberg capabilities to S3.