Engineering teams run inference workloads on Kubernetes clusters for easier scaling and job management. But GPUs, one of the most expensive resources you can run in the cloud, are easy to set up in Kubernetes and forget about the cost. Most teams don’t have visibility or cost savings methods implemented for GPU workloads, so they often overpay without knowing it.

Kubernetes GPUs Cost Benefits

Maintaining GPUs with Kubernetes is easier than provisioning standalone GPU VMs and managing them manually, especially when workloads vary throughout the day or week. It also provides cost visibility and optimization benefits.

For one, you can configure your cluster to run multiple jobs on the same GPU. Instead of dedicating an entire GPU to a single workload, Kubernetes can help you maximize usage across jobs—saving you both capacity and money.

It also gives you deeper visibility into GPU usage. With the right tools in place, you can attribute GPU memory consumption to specific jobs, users, or namespaces. Combined with a scheduler that supports autoscaling, Kubernetes can dynamically consolidate workloads and scale GPU nodes up or down as needed.

In contrast, running standalone GPU VMs often leads to over-provisioning and idle capacity, without the insight or flexibility to fix it.

Kubernetes GPU Pricing for AWS, Azure, and GCP

Each of the three main cloud providers has its own GPU offerings for Kubernetes.

Cloud providers differ in GPU model availability, pricing, and regional coverage. However, common challenges such as idle usage, memory inefficiency, and lack of observability persist across platforms.

Cloud pricing is typically per instance-hour, not based on actual GPU usage. You’re billed for the full GPU allocation, regardless of utilization. Billing dimensions usually include: GPU instance size and family, number of GPUs per node, total node uptime, and any attached storage or networking resources.

Pricing is listed in the US East region.

AWS EKS GPUs

Charges include the underlying EC2 instance, a version support fee of $0.10 per cluster per hour, and additional per-second Auto Mode charges when enabled.

EC2 GPU Instance On-Demand Hourly Cost vCPU Memory (GiB) GPU Memory (GiB)
g4dn.2xlarge $0.75 8 32 16
g4dn.4xlarge $1.20 16 64 16
g4dn.8xlarge $2.18 32 128 16
g4dn.12xlarge $3.91 48 192 64
g4dn.16xlarge $4.35 64 256 16
g4dn.metal $7.82 96 384 128
g4dn.xlarge $0.53 4 16 16
g5.2xlarge $1.21 8 32 24
g5.4xlarge $1.62 16 64 24
g5.8xlarge $2.45 32 128 24
g5.12xlarge $5.67 48 192 96
g5.16xlarge $4.10 64 256 24
g5.24xlarge $8.14 96 384 96
g5.48xlarge $16.29 192 768 192
g5.xlarge $1.01 4 16 24
g5g.2xlarge $0.56 8 16 16
g5g.4xlarge $0.83 16 32 16
g5g.8xlarge $1.37 32 64 16
g5g.16xlarge $2.74 64 128 32
g5g.metal $2.74 64 128 32
g5g.xlarge $0.42 4 8 16
g6.2xlarge $0.98 8 32 24
g6.4xlarge $1.32 16 64 24
g6.8xlarge $2.01 32 128 24
g6.12xlarge $4.60 48 192 96
g6.16xlarge $3.40 64 256 24
g6.24xlarge $6.68 96 384 96
g6.48xlarge $13.35 192 768 192
g6.xlarge $0.80 4 16 24
g6e.2xlarge $2.24 8 64 48
g6e.4xlarge $3.00 16 128 48
g6e.8xlarge $4.53 32 256 48
g6e.12xlarge $10.49 48 384 192
g6e.16xlarge $7.58 64 512 48
g6e.24xlarge $15.07 96 768 192
g6e.48xlarge $30.13 192 1536 384
g6e.xlarge $1.86 4 32 48
gr6.4xlarge $1.54 16 128 24
gr6.8xlarge $2.45 32 256 24
p4d.24xlarge $32.77 96 1152
p5.48xlarge $98.32 192 2 TiB 640 GB HBM3


AWS EKS current generation EC2 GPU instances. (scroll to see full table)

Azure AKS GPUs

Azure charges for the underlying VM chosen as well as a charge per cluster hour, $0.10 per cluster per hour for the Standard tier and $0.60 per cluster per hour for the Premium tier.

Azure GPU VM On-Demand Hourly Cost CPU Memory (GB)
NC40ads H100 v5 $6.98 40 320
NC80adis H100 v5 $13.96 80 640
NCC40ads H100 v5 $6.98 40 320
NC6s v3 $3.06 6 112
NC12s v3 $6.12 12 224
NC24s v3 $12.24 24 448
NC24rs v3 $13.46 24 448
NC4as T4 v3 $0.53 4 28
NC8as T4 v3 $0.75 8 56
NC16as T4 v3 $1.20 16 110
NC64as T4 v3 $4.35 64 440
NC24ads A100 v4 $3.67 24 220
NC48ads A100 v4 $7.35 48 440
NC96ads A100 v4 $14.69 96 880
ND96isr MI300X v5 $48.00 96 1850
ND96isr H100 v5 $98.32 96 1900
ND96amsr A100 v4 $32.77 96 1900
ND96asr A100 v4 $27.20 96 900
NG8ads V620 v1 $0.64 8 16
NG16ads V620 v1 $1.27 16 32
NG32ads V620 v1 $2.54 32 64
NG32adms V620 v1 $3.30 32 176
NV12s v3 $1.14 12 112
NV24s v3 $2.28 24 224
NV48s v3 $4.56 48 448
NV4as v4 Currently Unavailable 4 14
NV8as v4 Currently Unavailable 8 28
NV16as v4 Currently Unavailable 16 56
NV32as v4 Currently Unavailable 32 112
NV6ads A10 v5 $0.45 6 55
NV12ads A10 v5 $0.91 12 110
NV18ads A10 v5 $1.60 18 220
NV36ads A10 v5 $3.20 36 440
NV36adms A10 v5 $4.52 36 880
NV72ads A10 v5 $6.52 72 880


Azure AKS current generation VM GPU instances (scroll to see full table)

GCP GKE GPUs

GCP charges for the underlying Compute Engine instances and a GKE cluster management fee: $0.10 per cluster per hour for the Standard edition and $0.00822 per vCPU per hour for the Enterprise edition.

You can choose from the following GPU VM types.

VM GPU Memory On-Demand Hourly Cost
a4-highgpu-8g8 GB HBM3eCurrently Unavailable
a3-ultragpu-8g1128 GB HBM3e$84.80690849
a2-ultragpu-1g80 GB HBM3$5.06879789
a2-ultragpu-2g160 GB HBM3$10.13759578
a2-ultragpu-4g320 GB HBM3$20.27519156
a2-ultragpu-8g640 GB HBM3$40.55038312
g2-standard-424 GB GDDR6$0.70683228
g2-standard-824 GB GDDR6$0.85362431
g2-standard-1224 GB GDDR6$1.00041635
g2-standard-1624 GB GDDR6$1.14720838
g2-standard-2448 GB GDDR6$2.0008327
g2-standard-3224 GB GDDR6$1.73437653
g2-standard-4896 GB GDDR6$4.00166539
g2-standard-96192 GB GDDR6$8.00333078


GCP GKE current generation VM GPU instances (scroll to see full table)

Or, you can also attach GPUs to N1 VMs manually, listed below (costs also apply for the N1 instance, billed $0.03 per hour).

VM GPU memory On-Demand Hourly Cost
NVIDIA T4 Virtual Workstation - 1 GPU 16 GB GDDR6 $0.55
NVIDIA T4 Virtual Workstation - 2 GPUs 32 GB GDDR6 $1.10
NVIDIA T4 Virtual Workstation - 4 GPUs 64 GB GDDR6 $2.20
NVIDIA P4 Virtual Workstation - 1 GPU 8 GB GDDR5 $0.80
NVIDIA P4 Virtual Workstation - 2 GPUs 16 GB GDDR5 $1.60
NVIDIA P4 Virtual Workstation - 4 GPUs 32 GB GDDR5 $3.20
NVIDIA P100 Virtual Workstation - 1 GPU 16 GB HBM2 $1.66
NVIDIA P100 Virtual Workstation 2 GPUs 32 GB HBM2 $3.32
NVIDIA P100 Virtual Workstation - 4 GPUs 64 GB HBM2 $6.64


GCP GKE current generation N1 GPU attachments (scroll to see full table)

Optimizing GPU Costs in Kubernetes: Right-Sizing, Autoscaling, and Sharing Strategies

As you can see, the hourly price is inherently expensive, but selecting the right size and running workloads efficiently can make a big difference.

Select the Right GPU Instance and Right-Sizing

Most GPU waste comes from choosing an instance that’s bigger than necessary.

A big reason is because GPU processing time is hard to measure accurately. Metrics can be noisy and inconsistent. Memory usage, on the other hand, is more stable and measurable, and it’s often the best proxy for cost efficiency.

You can calculate the idle memory by comparing memory used by workload to available memory per GPU.

idle_memory = total_allocated_memory - used_memory

Then, determine the number of GPUs needed to get a much better idea of which instance type is the best fit.

To do this automatically for certain NVIDIA workloads, the Vantage Kubernetes agent integrates with NVIDIA DCGM and automatically calculates GPU idle costs by attributing GPU memory usage per workload. This provides a granular view of how memory is consumed, helping you avoid over-provisioning and improve cost efficiency.

Autoscaling

Kubernetes autoscalers can dynamically consolidate workloads and scale GPU nodes up or down as needed. This ensures you’re only paying for the resources you actually need. Proper autoscaling configuration requires careful tuning of thresholds and scaling policies to avoid oscillation while still responding promptly to changing demands.

Commit and Save

While On-Demand pricing gives flexibility, committing to compute usage via Savings Plans can significantly reduce the hourly instance costs for long-running GPU workloads.

For example, the g6.48xlarge Savings Plan rate is $8.69 per hour compared to the $13.35 On-Demand rate, 35% less.

GPU Sharing

Running multiple workloads on a single GPU is one of the most effective ways to reduce cost, especially for smaller models or batch inference jobs that don’t fully utilize GPU resources. There are a few strategies available, each applicable and listed as options by the three main cloud providers.

Time Slicing/Sharing

Time slicing provides a simple approach to GPU sharing by allocating GPU time slices to different containers on a time-rotated basis. This works well for workloads that don’t require constant GPU access but still benefit from acceleration.

Read more about how to implement:

NVIDIA MPS

NVIDIA CUDA Multi-Process Service (MPS) allows processes to share the same GPU. In Kubernetes, this is supported through fractional GPU memory requests. Instead of requesting a full GPU, you can request only the memory your workload needs. For example, you can specify a resource request like nvidia.com/gpu-memory: 4Gi to request a subset of memory on a shared GPU.

Multi-Instance GPUs

NVIDIA Multi-instance GPUs (MIGs) are applicable for certain NVIDIA GPUs, such as NVIDIA’s A100 Tensor Core. It provides hardware-level partitioning of a GPU into multiple isolated instances. Each MIG instance has its own memory, cache, and compute cores, providing better isolation than software-based sharing methods.

Conclusion

By right-sizing GPU instances, enabling autoscaling, and leveraging sharing strategies like time slicing or MIGs, teams can significantly reduce GPU costs in Kubernetes. These optimizations not only lower spend but also improve resource efficiency.