In this episode, I take a look at how to measure the availability of our systems. Much of this episode is inspired by the Sight Reliability Engineering practices that come out of Google Why you might be interested in this episode
In this episode, I take a look at how to measure the availability of our systems.
Much of this episode is inspired by the Sight Reliability Engineering practices that come out of Google
Why you might be interested in this episode
Or listen at:
Published: Wed, 30 Mar 2022 15:47:27 GMT
Hello, and welcome back to the Better ROI from Software Development podcast.
In this episode, I want to take a look at how we measure the availability of our systems. Much of this episode is inspired by the Site Reliability Engineering practises that come out of Google.
So why might you be interested in this episode?
Let's start with a bit of a recap, in the last episode, I introduced the set of practises and principles that Google employs to run at the scale it does: Site Reliability Engineering (or SRE). Wikipedia describes Site Reliability Engineering as:
"Site reliability engineering is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps"
Site Reliability Engineering originated out of Google and was a direct response to handling massive scale.
Within the SRE practise is how to measure availability of its services. And it does this through looking at three things:.
But I talk about those more later in this episode.
But first, let's talk about availability; in one of my prior roles I inherited an internal Service Level Agreement (SLA) from my predecessor. Our main website had an SLA of 99.9% uptime, which amounts to no more than 8.76 hours downtime per year. We used an external system to check every five minutes if the site was available. Then, once a month, I diligently check the figures and report them back to the board. If the SLA fell beneath 99.9%, then I'd expect to receive a stern conversation, which I in turn invariably had the same conversation with the hosting provider, which may have resulted in a small rebate on the monthly bill.
And then we carried on, as we've done previously.
Now, in hindsight, there are a number of problems with this.
We're only checking to see if the website is available the up time. We aren't checking the website was usable by the customer - does a customer just get a loading spinner?
Or that the website allowed the customer to create an order - or did it just error?
Or that the order was completed successfully in a timely manner - did it take 30 minutes to complete an order when it should have been seconds?
And we were only checking on a periodic basis every five minutes - by using this type of sample testing, we have no idea if the website was down for four minutes in between two checks.
In hindsight, none of this was that useful.
Now, if we look to this through the Service Reliability Engineering lens, first, we'd want to establish what Indicators would correlate to our business objectives. And in the case of this example, that would be very likely be linked to the customer experience. So maybe we'd want an Indicator to compare the number of successful orders against all orders made.
We would then want to establish our Objective for that Indicator - what percentage of successful orders should we have for a given period?
Now, at this point, you may well be thinking, well, it should be 100% - we don't want to lose any orders. However, achieving 100% can be exceptionally expensive, and in the case of many metrics, prohibitively so.
The more resilience that you want from a service, the more expensive it will be to operate.
For most of its services, Google doesn't attempt to reach 100% - they know the customer is unlikely to receive 100% reliability getting into their service in the first place - be it network problems or faulty equipment - so why overspend when the customer is realistically not expecting 100%?
Rather, your starting point should be looking at the historical performance against that Indicator. That historical performance helps to provide a baseline. And from this, you can establish an "appropriate" Service Level Objective.
Note that I say "appropriate"; if the Service Level Objective is a stretch goal, and resources not appropriately provided to improve the service to that level, then you will fail repeatedly and no one will have faith in the service or the team.
Interestingly, Google only has two versions of a Service Level Objective, an internal and an external. The external is shared with its customers and will form part of the Service Level Agreement. So, for example, 99.9%.
Whereas the internal SLO will be more stringent, say, for example, 99.95% - allowing the team a buffer between expected, and where they would invoke any Service Level Agreement actions.
As an aside, when setting a Service Level Objective, it should be made clear to the wider organisation what that means and the potential impact. An SLO of 99.9% is still 8.76 hours per year. If that entire 8.76 Hours occurs on your busiest day, then while still possible to be within the SLO for the year, it is likely to cause many impacts to the organisation.
But we do need to remember, failure can and will happen. There's an SRE saying which says "hope is not a strategy". And unfortunately, failure is a fact of life, thus the Service Level Objectives need to be discussed and understood to manage expectations.
But by having that meaningful SLO allows you and your organisation to choose between making the site more reliable - which would increase costs and slow development potentially - or making the site less reliable - which allows greater velocity of development.
There's a choice that you and your organisation need to balance - and it helps you to frame your options.
So let's tie this all back to the Service Reliability Engineering principles;
The SLI, the Service Level Indicator, is the metric to be observed - in our early example that was the successful orders.
The SLO, the Service Level Objective, is the required level of that Service Level Indicator, or a collection of Service Level Indicators - so in our example, we wanted our SLO to be 99.9% of successful orders.
And the SLA, the Service Level Agreement, is what happens if that SLO is breached. For internal systems, there is unlikely to be a penalty, whereas for external systems, the SLA is most likely to stipulate a financial penalty if the SLO is breached.
In this episode, I've talked about:
If you're interested in learning more about this subject, I'll include two links in the show notes to articles written by Google about this very subject.
In the next episode, I want to look at a concept that builds on this: Error Budgets.
Thank you for taking the time to listen to this podcast. I look forward to speaking to you again next week.