Detective work is an important part of FinOps. When there’s an unexpected cost increase or decrease, it’s important to figure out why there was a fluctuation. Typically, detective work is a combination of exploring different Cost Reports and having conversations with engineers that are responsible for incurring these costs.
Cloud cost visibility tools, like Vantage, are an important part of this process, but you want to make sure the cost data you’re exploring is set up in a way that simplifies the detective work process, rather than complicating it. This is where it’s important to understand the concepts of cost versus usage when building your Cost Reports.
A Detective Work Scenario
As an example, let’s imagine a company with a solid FinOps practice. All their infrastructure is properly tagged with just two FinOps-related tags: team
to identify the engineering team responsible for these costs and service
denoting the internal application running on this infrastructure.
Each week, engineers look at a Cost Report that shows their respective costs. This Cost Report is filtered to their specific team, so it only shows their incurred costs. These costs are grouped by service, and can see at-a-glance week-over-week costs for immediate cost feedback, and month-over-month costs for the larger context.
This week’s report shows a 20% increase in cost for one of the team’s applications. This spike is unexpected—nothing materially changed with the product, and no new features were launched. This kicks off some detective work, which means digging into the data to figure out why these costs changed.
But the investigation is proving more difficult than expected. Since there was no new launch, that can be ruled out as a reason. To double-check, the team looks at the total number of machines in its fleet, which is unchanged before and after the spike.
Pouring through logs and usage data doesn’t bubble up any anomalies either. There are no product usage increases that match the cost increase.
Instead, there’s a cost increase that doesn’t seem to tie at all to any usage increase.
An Unlikely Solution
So, what was causing the cost spike? All the Cost Report dashboards were set up by a centralized FinOps team. When building these reports, special care was taken to make sure all costs were completely accurate: all numbers in the Cost Reports match the costs billed in the monthly invoice.
This is incredibly useful for an accounting department, but it can cause engineering teams to chase their tail when doing detective work.
Unbeknownst to the team investigating their cost increase, a different team launched a new product. Normally, this shouldn’t cause any issues for the first team, as everything is tagged correctly, so new costs are properly reported on each respective Cost Report.
In this case, however, something happened behind the scenes with the company’s Savings Plan coverage. Savings Plans are a financial instrument that add additional savings to compute instances, but the savings are applied to instances automatically by Amazon: rebalancing regularly to ensure the deepest discount.
When Team 2 launched its new product, it used an instance type with large Savings Plan discounts. As a result, the Savings Plans that were previously being applied to Team 1’s infrastructure automatically moved to Team 2.
This loss of Savings Plan coverage on Team 1’s infrastructure caused a cost increase completely unrelated to any action either team took.
Cost vs Usage
Oftentimes, when setting up a new cost visibility program, too much emphasis is placed on making sure that the costs in each Cost Report match the totals in the monthly invoice.
But it’s important to remember that different teams need different types of cost visibility. Finance and accounting often want Cost Reports that perfectly reflect the incoming bills. But for other teams, such as engineering, this can do more harm than good.
When building Cost Reports for engineers, it’s more important to only show cost changes that occur due to actions taken by engineers, and filter out any cost fluctuations that are outside their control.
This often means creating Cost Reports that filter out all discounts. In this sans-discount Cost Report, engineers will see totals that are more expensive than the actual bill, but the cost fluctuations that remain will be meaningful and actionable, and still proportional to the total costs incurred by their team.
At its core, cost is usage
multiplied by rate
and discount
. When designing Cost Reports for different teams, make sure you’re only showing the parts of this formula that are meaningful to those consuming the reports.
Lower your AWS costs.