#84: Service Level Agreements - an episode 83 follow up

In episode 83, I had a long chat with Trevor Ewen about how he provides software service to non-technical clients.

If you've not listened to it, it was an excellent interview full of wonderful insights.

So much so that I want to pull a number of topics from that podcast and look at them in more depth.

In this episode, I want to look at the SLA

Or listen at:

Published: Wed, 12 May 2021 15:46:40 GMT

Transcript

Hello and welcome back to The Better ROI from Software Development podcast.

In episode 83, I had a long chat with Trevor Ewen about how he provides software services to non-technical clients. If you've not listened to it, it was an excellent interview, full of wonderful insights.

So much so I wanted to pull a number of topics from that podcast and look at them in more depth.

In this episode, I want to look at the SLA.

Firstly, what is an SLA. SLA stands for Service Level Agreement, and Wikipedia describes it as being.

"A service-level agreement (SLA) is a commitment between a service provider and a client. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user."

In the interview, Trevor talked about how he was using SLAs to make it clear to his clients exactly what he and his organisation were providing. And possibly even more valuably, what they weren't.

So why have an SLA?

First and foremost, it helps set expectations, it makes both parties clear as to what is happening and this really does help to avoid friction later down the line.

In the interview, Trevor gave the example of clients potentially expecting his staff to work through the middle of the night and at weekends to fix trivial issues.

And as a client, potentially, you have that expectation.

Potentially you have the expectation that you're going to get a gold standard, you've got people on standby and available for all times for anything.

But realistically, that doesn't happen.

And the same is very true internally, not just in a situation where it's a third party providing you services. Even if you have an internal team, it's good to understand and make explicit the expectations you have for that team.

I've seen organisations that have an expectation that their software is being maintained and available and reliable to a gold standard, they believe at any point, day or night, if something goes wrong with that software, someone is on call, someone who is capable of being able to jump on, fix it, resolve it and the business to suffer as close to zero impact as possible.

However, in reality, there has actually been a considerable breakdown in communication. There simply was not the capability, the resources or even the quantity of staff within the team to be able to provide that level of service. The truth was closer to being a best endeavours rather than any form of reliable service level agreement.

So when the inevitable happens, something goes wrong and the business are at a loss to understand why half an hour, an hour, two hours, four hours into the problem that somebody isn't there and fixing and resolving it by now.

They have entered into that issue, that situation, believing they have one vision of reality. They expect something to happen because this is how they expect it to be. They have their expectations. Unfortunately, it did not meet the level of reality. And as such, they were sorely disappointed when it took considerably longer and was considerably more impactful for the business to solve that problem.

And this is what a service level agreement, even if it's an informal one, helps us to understand. It helps us to avoid those unpleasant surprises when things go wrong.

By its very nature, a Service Level Agreement, is two parties agreeing to a certain level of quality, availability and responsibility. They're agreeing to a certain way the service will work.

It is important at this point probably to highlight the word agreement - both parties must be able to agree on whatever is defined within that level of quality, availability and responsibility.

You cannot force an SLA onto another party from either direction, otherwise it simply will not be committed to. It has to be something that the parties can both agree to.

So going back to that previous example, if the organisation had insisted that their staff provide a 24 hour, seven day a week, emergency response, they could put that into an SLA, they could even put it on to the team. But realistically, unless the team are able firstly to be physically able to provide that service, and secondly then able to buy in and commit to that service, the SLA is largely useless.

It has to be something that both parties can agree to and commit to.

And often simply going through that conversation of being able to try and arrange that agreement, arrange that commitment, is an eye opener. Both parties start to understand the expectations and the limitations. And between that, they get a much closer understanding of what is possible. And of course, once they know what is possible and if it falls short of where they want as an organisation for that standard to be, then they know they have work to address. They know they need to do something to be able to put the organisation as a whole into a situation where it can be agreed to or at least reach a level where everybody is comfortable with that commitment.

As a method of exposing mismatches between expectations, a method of highlighting surprises before they actually happen, I think them very similar to Gamedays.

I introduced Gamedays in Episode 47 as part of looking at how to proactively look at what happens when the system goes wrong. The Gameday is a bit like a role playing exercise where you look at a system and ask "what if?"

"What if this stopped working?"

"What if this broke?"

"What if this server was turned off?"

"What if this network connection was broken?"

And its asking those "what if"'s of the team to understand what would happen.

Is the system built in such a way that it could automatically failover. Say, for example, if one server went down, have you got a second server there as a redundant server so the service will continue operating without effect?

If so, how would anyone know that first server went down?

How would they know to go fix it?

How will they know before the second server went down as well?

It's a way of going through not just the technical aspects of any software solution or platform to understand where the gaps, the holes, the issues might arise, but also in the training of the team, preparing the team, thinking about "OK, are we in a position to support this at 3:00 in the morning?", "Are we in a position to support this if our primary data centre goes down?"

By going through this exercise again, you're looking for where unpleasant surprises could occur when things go wrong. Much better to do this in a fabricated situation, much better to do this when you're role playing in a boardroom than when it's your busiest time of the year, and it occurs right in the middle of it, right in the middle of the night.

This approach is exemplified by Netflix.

Netflix brought to the world the concept of Chaos Engineering. Netflix actively turns services on and off throughout the day. They may turn a network server off. They may take a data centre off. They may introduce various faults within their system, and they do that in production during the working day.

Why? Because it allows them to be confident that when a fault occurs - and it will without them introducing it - the system will cope.

By proactively introducing fault, they can monitor, validate the effects and, if it's having a negative effect, reverse it immediately. So that next time they can make the appropriate changes, so that should that event happen again the system can cope.

Think about it as being an immunisation shot against a disease. Think about it as being a vaccination against Covid-19. By effectively introducing small failures into the system in a controlled manner, the platform, through the team's work, evolves and effectively developed antibodies to be able to cope with those problems. And over time, the system just gets stronger and stronger and stronger and much more resilient to most failures.

Now, it's obviously a very advanced level that Netflix have reached. They've spent a lot of time building up from the ground up.

They would have started with very dry "what if" scenarios in a boardroom without going anywhere near the production system originally. And slowly over time, they've introduced more and more pieces to try and expand that Chaos Engineering minds and work.

So both Service Level Agreements and Gamedays help us to improve our communication and our relationship between parties within the organisation. Or indeed, if you're like Trevor, a third party organisation.

Better communication produces better working relationships.

It helps us remove surprises from our working day. It also helps us to work together better to handle any surprises we hadn't thought about as they do occur.

Ultimately, we helped to build better trust and ultimately better outcomes.

Think about how much a surprise like this could affect your organisation. Think about how much a surprise where you think that the team is capable, prepared and committed to solving problems - maybe in the middle of the night, maybe to that gold standard - but in reality they aren't.

You're sitting there under the misbelief that your business is protected, ensured, inoculated, but in truth it isn't.

Without having those real conversations and ultimately testing it - so between that conversation of having the SLA to understand the commitment and then things like Gamedays to prove it - how sure can you be that your business will survive the next surprise?

Thank you for taking the time to listen to this episode. I look forward to speaking to you again next week.