Vantage Launches Kubernetes GPU Idle Costs: Calculate Efficiency for AI-Intensive Workloads

by Vantage Team


Vantage Adds the Ability to Upload Labeled Metrics for Calculating Unit Costs

Today, Vantage announces the launch of the ability to collect and report Kubernetes GPU idle costs. For each Kubernetes pod, customers can now view the idle and total costs for GPU usage within a Kubernetes cluster, which allows for easier identification of underutilized resources. GPU memory usage is now available on Kubernetes efficiency reports and included in the cost efficiency score per pod.

Kubernetes efficiency reports currently calculate idle costs using CPU and RAM. However, for customers with AI workloads, most of the cost of a pod is driven by GPU requests. The Vantage Kubernetes agent currently collects these GPU requests and incorporates them in the total cost of the pod, but it does not report on how much of this request was idle and does not include these resources in efficiency reports.

Now, the Vantage Kubernetes agent will automatically collect GPU usage information using data from the NVIDIA DCGM Exporter. As a result, Vantage customers can view GPU idle costs, along with CPU and RAM, on Kubernetes efficiency reports. The GPU category is available for filtering and aggregating within a report.

A Kubernetes efficiency report filtered for GPU in the console

A Kubernetes efficiency report filtered for GPU in the console

This feature is now available for all users with Vantage Kubernetes agent version 1.0.26 or later (available as part of Helm Chart version 1.0.34) installed. Users also need to install the NVIDIA operator on their Kubernetes cluster. Once the operator is installed, the agent will begin to upload the data needed to calculate idle costs. The data will be available on reports within 48 hours as the costs from the infrastructure provider are ingested. If you do not have the Kubernetes agent deployed but would like to see GPU costs, see the Vantage Kubernetes agent documentation. To learn more about Kubernetes efficiency reports, see the Kubernetes efficiency metrics documentation.

Frequently Asked Questions

1. What is being launched today?

Today, Vantage is launching GPU memory idle costs as a part of Kubernetes efficiency reports.

2. Who is the customer?

The customer is any Vantage user who has Kubernetes clusters with NVIDIA GPUs deployed and has the Vantage Kubernetes agent deployed.

3. How much does this cost?

There is no additional cost to collect and report on GPU idle costs.

4. How are GPU idle costs calculated?

When an instance includes GPUs, 95% of the cost of the node will be allocated to the memory of the GPU. The number of GPUs requested by the pod will dictate how much of the total memory is allocated to the pod. The idle costs are calculated using the used and total memory from allocated GPUs for the pod down to the container level—(i.e., idle_memory = total_allocated_memory - used_memory).

5. What needs to be installed on my cluster for the Vantage Kubernetes agent to collect this data?

Usage data is collected via NVIDIA/dcgm-exporter. This is included as part of the NVIDIA/gpu-operator, but it can also be installed independently. The agent scrapes the exporter directly and exposes configuration for the namespace, service name, port name, and path. The default values are configured for the gpu-operator default case.

6. How can I see the GPU idle costs in Vantage?

To view GPU idle costs, navigate to any Kubernetes efficiency report and set the Category filter equal to gpu. This filter criteria will show the GPU idle and total costs by cluster. You can add additional filters and group the report to see more specific versions of the data.

7. Are fractional GPU requests supported?

At this time, the Vantage Kubernetes agent collects only the most commonly used whole GPU requests. If you use fractional GPUs and want those represented, please contact support@vantage.sh.

8. Is GPU power usage considered as part of these calculations?

No, GPU utilization is not factored into efficiency; only GPU memory is tracked. If you have a workload that specifically needs GPU utilization, please contact support@vantage.sh.

9. Which GPU manufacturers are supported?

At this time, only metrics exported by NVIDIA/dcgm-exporter are supported. If you have a different manufacturer you want to see supported, contact support@vantage.sh.

10. Are there Kubernetes rightsizing recommendations for GPU memory?

At this time, GPU efficiency is not included in rightsizing recommendations.

11. Which cloud infrastructure providers are supported for calculating GPU idle costs?

NVIDIA GPU idle costs are available for the agent-supported infrastructure providers: AWS, Azure, and GCP.

12. Are there any GPU-specific settings I need to configure on the agent to begin collecting these metrics?

The agent needs to be installed or upgraded to include the following value: --set agent.gpu.usageMetrics=true. See the documentation for details.

13. Are there any requirements for installing the NVIDIA operator?

See the documentation for details on the steps needed to install the operator.