#130: To Checklist or not to Checklist

This episode, I want to take a look at Checklists - when to use and when not to.

Much of this episode is inspired by the Sight Reliability Engineering practices that come out of Google.

Why you might be interested in this episode

  • The value of a checklist
  • Situations where it is appropriate
  • And situations where it isn't

CORRECTION: During the episode I refer to "The Manifest Checklist" - this should have been "The Checklist Manifesto: How To Get Things Right" by Atul Gawande

Or listen at:

Published: Wed, 27 Apr 2022 16:10:57 GMT

Links

Transcript

Hello and welcome back to the Better ROI from Software Development podcast.

In this episode I want to take a look at Checklists - when to use them and when not too.

Much of this episode is inspired by the Site Reliability Engineering practises that come out of Google.

Why might you be interested in this episode?

Firstly, understanding the value of checklists.

And then looking at situations where they're appropriate.

And situations where they aren't.

But let's start with a recap; I've previously introduced the set of practise and principles that Google employs to run at the scale it does: Site Reliability Engineering (or SRE). Google describes site reliability engineering as:.

"Site reliability engineering is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps"

Site Reliability Engineering originated at Google and was a direct response of them having to handle massive scale.

When going through the canonical book on the subject, Site Reliability Engineering by O'Reilly, a few things came to mind regarding Checklists that I thought that I would share.

Checklists are a proven method of ensuring certain steps are followed.

For example, Checklists have a long history in avionics. Before each flight, a pilot will need to go through a pre-flight checklist. This has proven to drastically reduce mistakes - and in avionics, ultimately save lives. The same has also been seen in similar high risk environments, such as readiness checks for surgery.

The book "The Checklist Manifesto: How to Get Things Right" by Atul Gawande talks about how the use of Checklists can help avoid common failures. From the book's summary:.

"Today we find ourselves in possession of stupendous know-how, which we willingly place in the hands of the most highly skilled people. But avoidable failures are common, and the reason is simple: the volume and complexity of our knowledge has exceeded our ability to consistently deliver it - correctly, safely or efficiently.

In this groundbreaking book, Atul Gawande makes a compelling argument for the checklist, which he believes to be the most promising method available in surmounting failure. Whether you're following a recipe, investing millions of dollars in a company or building a skyscraper, the checklist is an essential tool in virtually every area of our lives, and Gawande explains how breaking down complex, high pressure tasks into small steps can radically improve everything from airline safety to heart surgery survival rates."

The SRE teams at Google use Checklists for readiness checks.

For example, they use a "Launch Coordination Checklist" whenever a new product is being launched. The Checklist has been built up of many years of experience; of actions that need to be taken ahead of a launch to maximise the chance of a successful start.

The Checklist will ensure that:

  • The architecture of the solution and key use cases are documented
  • That the servers, network, etc. are all set up with an appropriate level of resilience and redundancy
  • That any new domain names have been registered and correctly configured
  • That estimates of volume and capacity have been calculated and catered for included any anticipated launch spike
  • That data backup and restore has been tested
  • That disaster recovery plans are in place for common known failures and have been tested
  • That monitoring of the solution is in place and relevant alerts are configured
  • That security audits and reviews have been carried out
  • That relevant operational procedures are in place for running the solution
  • And that any upstream or downstream teams are aware and ready for the launch.

I'll provide a link in the show notes to a sample Google Launch Coordination Checklist.

So at this point, like me, you're probably thinking Checklists are great - we should be using them everywhere. But it was a further section in the Site Reliability Engineering book that made me rightly reconsider this.

When the book moved on to problem investigation, a key part of the Site Reliability Engineer's role, they explicitly recommended not using checklists. Rather, they recommend that any Site Reliability Engineer be coached and skilled in problem solving skills like reverse engineering practises and the ability to improvise.

They found that led to a much greater problem solving capability than simply providing a checklist to work through. They found that an engineer working from a checklist was likely to dead end, run out of ideas, if the procedure in the checklist didn't resolve the problem. Whereas being skilled in problem solving allowed them to adapt on the fly, to be able to draw correlations, produce hypothesis and validate them on the fly.

Ultimately, the engineer with greater problem solving skills was able to solve more problems than engineer running with a checklist.

At the time, I remember thinking, well that's obvious - of course I was likely to be the case. But then I took some time to think about it. It may not be that obvious, and to be honest, the reason why I'm actually recording this episode.

I'm reminded of times where I've put new products into production and having been asked to provide troubleshooting guides for them.

I would sit down and draw up a list of possible "what-if" scenarios, possible failure reasons, and then I document the resolution for each of those scenarios. While there was certainly value in doing that work, it would have limited the engineer in what steps they could possibly take. If I hadn't covered the specific scenario, or indeed my resolution was incorrect, then any engineer would largely be out of options and need to escalate to me at 3am in the morning.

As an aside, I'd also suggest that in the above situation, if there was an expected failure situation and a known resolution, then it should really have been automated - either to remove the failure situation in the first place or to detect the failure and apply the resolution.

Why wait for the problem to have become large enough to have needed engineering investigation? The system should have been self healing.

In this episode, I've talked about the value of Checklists.

They can help us to avoid common mistakes. They have proven value in getting things right in complex, risky environments. They are incredibly powerful in ensuring that we learn from history and follow the well-trodden path.

They are, however, not the right approach for the unknown, for the untrodden path.

If anything, in those situations, they are likely to give us a false sense of readiness as we over rely on troubleshooting guides, prepared months or years in advance of any problems. We can quickly find ourselves without options for anything outside of those predicted "what-if" scenarios.

Rather, we should ensure that we have the relevant problem solving skills to handle the unknown, for the untrodden path.

And even remove those "what-if" scenarios or automate any remedial actions.

Thank you for taking the time to listen to this podcast. I look forward to speaking to you again next week.