Azure Virtual Machines (VMs) consistently rank as the largest contributor to Azure cloud spend, accounting for over 20% of our customer’s costs. While several methods exist for optimizing costs, we’ve observed from anonymous customer data that Azure users, particularly when compared to their AWS EC2 counterparts, are not fully leveraging available cost optimization strategies. Notably, there’s a significant opportunity to utilize Reserved VM Instances or savings plans for compute. These pricing options, along with other cost optimization techniques we’ll explore can provide substantial savings on your Azure costs.

Azure Pricing Recap

Azure offers multiple pricing options. The most popular is pay-as-you-go pricing, which is charged per second and on-demand, meaning you can start and stop your VMs at your discretion. The others are Azure savings plan for compute, Reserved VM Instances, and Spot, which we will discuss later. They offer huge opportunities for savings, with discounted pricing in exchange for commitments or the tradeoff of interruptibility.

Factors like operating system/software and region can affect the cost of a VM and VMs are priced according to factors like CPU size, RAM, and temporary storage. In addition, the category of VM/family plays a huge part in determining the overall cost and performance characteristics of the VM.

Selecting the Correct Azure VM

Choosing the right VM is an important step in optimizing both performance and cost. There are hundreds of VMs to choose from, each with unique specifications tailored to different workload requirements. These specifications span various processors, features, and performance characteristics, allowing for fine-tuned selection based on specific needs.

To streamline the selection process, Azure organizes its VMs into six categories, each containing its own families. Selecting a family that closely meets your use case is a sure-fire way to ensure you’re getting the compute you need without over-provisioning and overpaying.

Category Azure Recommended Use Cases Current Generation Families and Azure Recommended Workloads
General Purpose Testing and development, small to medium databases, and low to medium traffic web servers A (entry-level economical)
B (burstable)
D (enterprise-grade applications, relational databases, in-memory caching, data analytics)
DC (D-family with confidential computing)
Compute Optimized Medium traffic web servers, network appliances, batch processes, and application servers F (medium traffic web servers, network appliances, batch processes, application servers)
FX (electronic design automation, large memory relational databases, medium to large caches, in-memory analytics)
Memory Optimized Relational database servers, medium to large caches, and in-memory analytics E (relational databases, medium to large caches, in-memory analytics)
Eb (E-family with high remote storage performance)
EC (E-family with confidential computing)
M (extremely large databases, large amounts of memory)
Storage Optimized Big Data, SQL, NoSQL databases, data warehousing, and large transactional databases L (high disk throughput and IO, big data, SQL and NoSQL databases, data warehousing, large transactional databases)
GPU Accelerated Compute-intensive, graphics-intensive, and visualization workloads NC (compute-intensive, graphics-intensive, visualization)
ND (large memory compute-intensive, large memory graphics-intensive, large memory visualization)
NG (virtual desktop, cloud gaming)
NV (virtual desktop, single-precision compute, video encoding and rendering)
High Performance HPC workloads HB (high memory bandwidth, fluid dynamics, weather modeling)
HC (high density compute, finite element analysis, molecular dynamics, computational chemistry)
HX (large memory capacity, electronic design automation)

Azure VM categories (source)

Once you’ve selected an appropriate VM family, Azure offers multiple VM sizes within that family with varying CPU, memory, storage, and networking capabilities. Later, we’ll go over choosing a size that closely matches your workload requirements, accounting for spikes in CPU, to avoid over-provisioning (see section).

Burstable VMs

The Burstable B-series VMs offer a unique and cost-effective solution for workloads with variable performance needs, such as development and test environments, low-traffic web servers, small databases, and POC deployments. They provide a baseline level of CPU performance (typically 5-40% of the VM’s full capacity) with the ability to burst to higher levels when required. It works using a credit-based system where CPU credits are accumulated when it’s below it’s defined baseline performance level. Then, the credits are used to burst above the baseline when workload demands increase, allowing the VM to utilize up to 100% of its CPU capacity for short periods

Save on Azure VMs With Pricing Options

While pay-as-you-go plans are ideal for use cases like unpredictable workloads, short-term projects, or when you’re starting in the cloud and are unsure of your needs, they are typically the most expensive. For stable, long-running workloads, you can save by committing to a 1- or 3-year term. Alternatively, for interruptible workloads, you can save using Spot instances. Organizations often use a combination of pricing options for different VMs.

Azure Virtual Machine On-Demand Distribution

Graph of pay-as-you-go vs committed use indicating Azure VM users could save by switching to Reserved VM Instances or savings plans for compute

Reserved VM Instances

Reserved VM Instances offer significant cost savings, up to 72% compared to pay-as-you-go pricing. They’re ideal for steady-state workloads with predictable resource needs, such as production environments, long-running applications, or continuous dev/test setups. They require a 1-year or 3-year commitment to a specific VM size and region. This commitment allows for deep discounts but reduces flexibility since if your needs change, you may end up with underutilized reservations.

Azure Savings Plans for Compute

Savings plans for compute provide a more flexible option for long-term savings, offering up to 65% off pay-as-you-go rates. Like RIs, they require a 1-year or 3-year commitment, but the commitment is to a specific amount of compute usage per hour rather than a specific VM size. This flexibility allows you to change VM sizes or even switch between VMs, containers, and other eligible services without losing your discount. The drawback is the slightly lower maximum savings compared to RIs, but for many use cases, the added flexibility outweighs this difference.

Spot VMs

Spot tends to offer the deepest discounts, up to 91% off pay-as-you-go prices, by allowing you to rent unused Azure capacity. It’s ideal for interruptible workloads such as batch processing jobs, rendering tasks, or certain types of development and testing. The tradeoff is that Spot VMs can be interrupted with minimal notice when Azure needs the capacity back. This makes them unsuitable for many workloads, such as critical or time-sensitive ones. Also, prices for Spot fluctuate based on supply and demand, which can make costs less predictable. However, for workloads that can handle interruptions, the potential savings are substantial.

Azure Hybrid Benefit

Some users, often ones migrating from on-premises environments, still have subscriptions to on-premises core licenses. Azure Hybrid Benefit is available for qualifying Linux and Windows subscriptions for a discounted price, since Azure can separate and remove the licensing cost.

It’s particularly beneficial for users migrating to the cloud since they can take advantage of dual use rights for 180 days, to avoid being double charged and use both on-premises licenses and VMs during the migration period. For further cost savings, Azure Hybrid Benefit can be combined with Reserved VM Instances, for up to 80% savings compared to the pay-as-you-go rate.

Visibility Tools for Azure VMs

The next few sections of recommendations require visibility into your VM usage and performance. For analyzing workloads and identifying optimization opportunities, such as over-provisioned instance sizes, unused VMs, and more, tools like Azure Monitor, Azure Advisor, and Vantage, can help you make data-driven decisions.

Azure Monitor to Analyze VM Usage

Azure Monitor provides detailed metrics on CPU, memory, storage, and network usage for your VMs. To ensure your VMs are rightsized, you can monitor the CPU percentage to see if it is consistently low. Additionally, you can check for unused VMs by analyzing metrics such as “how many requests were served by an App Service, the resource utilization for a specific App Service plan, or how many reads and writes occurred from a Storage account over the last time.”

Azure Advisor for Cost Saving VM Recommendations

Azure Advisor analyzes CPU and network usage over time and provides personalized recommendations for optimization opportunities. Recommendations for VMs may include identifying over-provisioned VMs and unused resources. However, it’s important to remember that recommendations are within Azure’s discovery period, so things like seasonal fluctuations in workload may not be considered.

Vantage for VM Custom Cost Reporting

Third-party solutions like Vantage extend VM visibility with specialized reporting features, cost recommendations, automatic anomaly detection, and budget notifications. Vantage helps organizations gain deeper insights into their VM spending patterns, enabling more precise financial planning and cost optimization. Cost recommendations include right sizing, unattached virtual hard disks, Reserved VM Instances, savings plans for compute, and more.

Azure Migrate for Migrating From On-Premises

Azure Migrate is used to assess on-premises workloads before migration. When choosing a VM, it can be used to analyze your current infrastructure and recommend appropriate VM sizes based on performance requirements and usage patterns.

Selecting the Correct Azure VM Size or Right Sizing Azure VMs

Using the above tools and resources, you’ll have a better understanding of your peak CPU, memory, storage, and network needs. Also consider any specific requirements like GPU or high I/O performance. With this information in hand, compare specifications of different sizes within your chosen VM family. Look for the smallest size that meets or slightly exceeds your peak requirements.

Once you’ve selected a size, deploy your application and run load tests to verify performance under typical and peak conditions. After deployment, continuously monitor resource utilization, considering downsizing if you consistently use less than 50% of any resource, or upsizing if you regularly exceed 80%.

Autoscale with VM Scale Sets

For workloads with fluctuating demand, autoscaling using VM scale sets will automatically adjust the number of instances based on a manual scale, set schedule, defined metric, or usage patterns. A common metric for scaling a VM is CPU usage, for example, increasing instances when your average CPU usage exceeds 65%. Schedule-based autoscaling is useful for predictable patterns, such as reducing instances during weekends when user activity is lower.

Azure VM Automatically Deallocate VMs

VMs can be in two non-running states: Stopped, where the VM is shut down through the operating system but still incurs compute charges, and Stopped (deallocated), where the VM is stopped through Azure, halting compute charges. Automatically deallocating VMs when not in use can significantly reduce costs, especially for non-production workloads or VMs with predictable usage patterns.

Azure offers several methods for automatic deallocation. The built-in Auto-Shutdown feature allows you to schedule a daily shutdown time for each VM individually. However, the drawbacks of needing to manually restart the machines and manage each VM individually are not ideal for cases where you are managing several VMs.

For more flexibility, the Start/Stop VMs v2 solution, part of Azure Automation, enables both scheduled start and stop operations, as well as more complex automation scenarios. This solution can schedule VM starts and stops on a recurring basis, shut down VMs based on low CPU usage, and start VMs on-demand through webhooks.

Delete Unused VMs and Associated Resources

Unused VMs will continue to incur charges, even when stopped, and even when deallocated (e.g., you are still charged a small amount for disk space). The first step is identifying VMs that are no longer needed using Azure Monitor and other resources.

Once you have done that it is of the utmost importance to check that it is no longer needed before deleting it by reviewing any internal documentation and verifying with the owners/teams. As an extra step, consider deallocating the VM for a period to maintain the VM’s configuration and make sure no issues arise. Once you’re ready to delete the VM, backup any necessary data, and delete the VM through the Azure portal, Azure CLI, or Azure PowerShell. Remember to delete its associated resources as well, including disks, virtual networks, bandwidth, and load balancers.

Conclusion

Optimizing VM costs is a multi-faceted approach. By selecting the right VM types and sizes, utilizing pricing options like Reserved VM Instances and savings plans for compute, and implementing automated scaling and deallocation strategies, users can significantly reduce their VM spend.