The Reliably SRE Playbook

  • SRE
cover

Contents

1. What is SRE?

As software architectures become increasingly complex and distributed, systems can break in new and unanticipated ways. Thankfully, SRE is here to help. Site Reliability Engineering (SRE) brings a set of sustainable practices and protocols that can be used to improve and extend DevOps workflows, with the end goal of improving reliability and happy end users.

SRE is a more proactive approach from developers, utilizing automation to help move delivery teams away from constant firefighting, and towards the things that matter – like building more responsive, more secure, and more reliable systems.

With the changing landscape of enterprise systems, the balancing act between from pure acceleration to product reliability is crucial for all business essential software. To explain a bit more about how that’s possible, let’s talk for a second about Formula 1. Yep, Formula 1. The race cars.

No more engine fires

Thinking of Formula 1, the first thing that comes to mind is probably speed. Every team wants their car to be the quickest, and the hoards of fans show up to see the drivers battle for that accolade.

History shows that the most successful Formula 1 teams are those with the most reliable cars, and not just the quickest. Think of it like this: every time a driver has to pull into the pit stop for mechanical attention, the team loses time.

What we’re trying to say is that you can’t be fast unless you’re reliable. If Formula 1 teams are dedicating engineers to that very task, why aren’t we doing the same in the world of IT?

Understanding this new approach to reliability engineering isn’t just about race cars. Nor is it just a case of understanding the processes that make up SRE. We’ll get to that. Instead, let’s begin by looking at the movement’s wider goals, and what they’ll help you and your delivery team to achieve.

Because SRE isn’t just a case of practical learning. It’s also a cultural shift in the way engineers practice their craft. We want developers and engineers to look at reliability in a new way – focusing always on the bigger picture, and the goals we all share.

If you want to build a ship, don’t drum up the people to gather wood, divide the work, and give orders. Teach them instead to yearn for the vast and endless sea. Antoine de Saint-Exupéry

Understanding reliability

First things first: SRE is all about reliability. This is what defines SRE as an evolution and logical extension of modern DevOps: recognizing that with increasing automation from the engineers, operational decisions often fall to the developer, and they want to be sure they are making the right choices. But what does that mean in practical terms?

Well, the foremost practical measure of reliability is a simple one: availability. Because remember: reliability is judged from the outside, not the inside. Before anything else, your system has to be there for its users – live and accessible for every second of every day, and working at a level that delights your users.

Other important aspects of reliability include things like durability, security, response latency, and various other metrics. It’s the job of site reliability engineers to constantly identify these sorts of risks to their systems, to assess just how much risk is acceptable, and then put frameworks in place to ensure their systems comply.

To be clear: despite the abundance of metrics and powerful monitoring tech, it’s ultimately the users who will judge the reliability of our systems. Sometimes this is measurable with data. Sometimes it’s not so simple. Which leads us onto the human side of site reliability engineering.

Defining our human-centric approach

So we know there’s a practical side to making a system reliable. But today’s definition of reliability goes further than that. Now it’s a question of creating something users can rely on in a deeper sense. It’s about aligning our principles with a usership, and showing them that we share their goals: ethically and aspirationally, as well as practically. In fact, this idea of shared goals runs throughout SRE thinking. At Reliably, we like to think of SRE meaning a more human-centric approach to reliability.

In other words the end goal of SRE is about more than just chasing good uptime numbers. Yes, keeping your systems up is still a top priority. But this approach will also have wider implications for the people building and using your systems, and that goal should be within our sights at all times.

This “bigger picture” culture – of wanting the best for all human parties involved – even filters down to the shape of the team. SRE teams are built on openness, ensuring access to information from all angles, and avoiding blame whenever things go wrong. Accepting failure is – to a degree – a part of the job. Just so long as the next step is learning from it.

Eliminating toil

At the centre of SRE’s drive for a better cognitive experience for both system engineers and their end users is the elimination, or reduction, of toil. In short, a reliable system frees developers from toil, and allows them to focus on meaningful and valuable work – innovating and improving on what went before.

But what exactly is toil? ‘Toil’ is an interesting term in SRE, one that has many definitions, from many lived experiences. Toil is pretty much anything that gets in the way of a developer doing the work they love, and need, to do. Unexpected refactoring and system outages are a few other common examples.

In plain terms, toil can be seen as the cognitive load placed on engineers to build, monitor, understand and communicate their (increasingly complex) systems. SRE practices are meant to help smooth-out those processes, but it can be challenging to get that in place when you are just starting up.

With Reliably in hand, engineers have a real superpower in their command line, offering automation, monitoring, and support for engineers to implement in real-time. No seminars, no additional training needed: just learn and apply the best parts of SRE for the deployment processes you’ve already got in place.

Bringing technical and tactical support together

We’ve already mentioned how a new approach to reliability can even change the shape of your team. And no wonder: things go easy when we all work together. And that’s why SRE seeks to open up the work place in a new way.

This way, observability for technical ethics and operations is becoming a reality, with accurate systems-level supporting. And this is exactly what we need in modern development spaces.

With structure and shared understanding, we remove the blame of the individual contributor to act as a “whistle-blower”, whether they are developers or product leads. Under SRE, both parties have access to the same information and a shared understanding. Regular, automated, systems-level reporting thus allows for error monitoring in operations. And this in turn allows for early detection of the system and operational drift – spotting problems before they become failures. This is what Google calls a a “blameless postmortem culture” (Loo10, All12).

Understanding why SRE is needed

The goals of SRE are so patently positive, you might assume that they’re already built into the mindset of development teams everywhere. It’s obvious, right? Well, apparently not.

The Uptime Institute reports a trend across their 2018, 2019, and 2020 surveys:

it is clear that outages occur with disturbing frequency, that bigger outages are becoming more damaging and expensive, and that what has been gained in improved processes and engineering has been partially offset by the challenges of maintaining ever more complex systems. Avoiding downtime remains a top technical and management challenge for all owners and operators.

The surveys show that systems still aren’t very reliable. Uptime numbers aren’t where they could and likely should be, with service agreements commonly being broken and targets missed. This is costing companies vast sums of money – a clear sign that things aren’t working.

So we see that the SRE movement is emerging from a very real need. It’s important to state that this isn’t necessarily the fault of developers. With systems growing ever more complex, and the demands on DevOps teams constantly expanding, This situation is understandable, as many development teams have become overstretched, and therefore under-utilized.

The emergence of SRE is a statement that the DevOps and development community can do better. We can put our skills to better use, and find smarter solutions to the problems of the day. This SRE Playbook is designed to give you the fundamental know-how to get started on that path: of empowering your teams to do things differently, and to be a part of the reliability revolution.

It’s also why we are building Reliably: It's not just an SRE-automation product (even though that’s our guarantee). Reliably defines a way of thinking about software design and deployment to reduce the cognitive load on the developer, while shifting the narrative focus to what really matters: the user experience.

2. How does SRE help?

What does this practice actually mean, in real terms, for its users? Here are a handful of insights into the joys of combining SRE and DevOps over time.

Better observability

Proactive reliability engineering is a nice idea, but how do we get there? Well, it begins with knowing where to look. And it helps if our systems can give us a clue. Enter: observability.

For delivery teams to work efficiently, they need to have a comprehensive inspectability of the state of their systems. In practice, this means a combination of monitoring, logging alerting – both of which are made possible by an abundance of tools available in the market, many of which are open source.

The ability for devs to use multiple different softwares simultaneously (Prometheus and Grafana, for example, are a popular duo) means we can visualize data in new ways, providing different perspectives.

The more we see, the more we know.

Reliable Systems, Happy Developers

Developers love developing, not firefighting. And the ability to build better, more powerful systems with less emotional stress is yet another easy sell. This, coupled with the drive for better user experiences, points toward the wider cultural benefit of “shifted left” engineering. It doesn’t just result in better products – it results in happier people. It’s all a cycle: what goes around comes around.

Less Toil = Happy Devs

In section 1, we touched on the evils of toil, and SRE’s aim to rethink reliability in such a way as to keep this tedium work to a minimum. And sure enough – for anyone who dislikes toil, DevOps is a real lifeline.

Better reliability engineering isn’t just of benefit to the so-called ‘front-line’ delivery teams. It’s also an easy sell for decision makers and the more fiscally-minded members of a DevOps team. Why? Well, site reliability engineers are in short supply.

Not only does the role require a very specific and expansive skill set, but the youthful nature of the discipline means there’s no abundance of experienced candidates. In other words, the practical work hours of a site reliability engineer are incredibly valuable to companies’ strategic minds (a fact reflected by the sizeable salaries on offer). So: toil isn’t just a waste of devs’ time; it’s also a terrible waste of money for those with an eye on the payroll.

Improved user journeys = happy customers

The upshot of all these behind-the-scenes improvements is ultimately about the end user. By freeing up developers to focus their skills where it matters, and using targeted service level indicators, objectives and error budgets to identify key areas in need of work, we’re able to make real advances in the customer journey.

Site reliability engineers listen closely to their community of users in order to ascertain which elements of the customer journey are the most critical. First: identifying painpoints. Second: judging their relative importance. Third: doing reliability engineering in accordance with the findings from one and two.

DevSecOps: Life beyond reliability

If DevOps declared an end to the dreaded silo, then SRE was definitely listening. Because the benefits to practicing good SRE extend well beyond just reliability. In addition to providing a solid foundation for improvements to speed and usability, good SRE also has huge implications for DevSecOps.

DevSecOps effectively applies the cultural approach of SRE to security. Instead of treating security as an afterthought – something to be clumsily applied at the tail-end of the development and deployment process – DevSecOps seeks to take a more holistic approach. In practice, this sees security considerations taking place throughout the entire lifecycle. As with other parts of SRE, the responsibility for security is then shared equally throughout the team, rather than existing in a silo.

It’s simple, but it’s clever.

3. How to do SRE the right way

Crucially, Reliably allows you to integrate all your SRE infrastructure with your existing systems and processes, meaning you’re able to build on your past successes.

If you’re ready to dive into Reliably and start improving the reliability of your systems, head over to our Docs for some hands-on learning.

Otherwise, here’s a primer to get you in the mood...

The path to SRE

As a cultural approach, adopting SRE is not a straightforward yes or no decision. No system is binary: either SRE, or not. Instead, it’s about gradually changing your approach to reliability engineering to align with new principles and techniques.

The first changes you’ll want to make in order to morph your DevOps team into an SRE super group include placing a greater emphasis on observability.

Your criteria for success will become different. Not always going straight for the easy fix, but instead making decisions based on how they will impact the lives of developers and end users in the long term. With these new metrics comes a new understanding of success.

How to integrate SRE with your existing system

Few SRE tools can integrate as seamlessly with your existing DevOps set-up as Reliably. You can get up and running wherever you want – inside GitHub, GitLab, or even locally using the Reliably CLI. The automated process scans your code to look for reliability issues, then gives you a clear report on what you can do to improve on your hard work.

Why have we made this possible? Because this is about expanding and improving your systems – not throwing them in the trash and starting from scratch.

How to measure SRE

Proactivity is part of the fabric of good reliability engineering. So it isn’t just a case of implementing practices and then reaping the rewards. Systems change. So you need to be on the lookout for ways to improve and avert disaster as your code evolves.

We’ve already talked about the importance of observability. The key to greater observability lies in targeted measurement: tracking the impact of changes you make in your code, and keeping close tabs on how systems perform against your primary criteria. This is the world of SRE metrics.

And the wonderful thing about site reliability engineering is that you’re never alone. There’s an entire community of engineers and Reliably users, all benefiting from one another’s skills and experience. This knowledge base informs the Reliably product every day, but it also exists within a community of users, helping one another find new ways to solve problems and get more from reliability.

Remember: this is new. We’re all learning.

4. Stuff you should know

Site reliability engineers are operating on new land. And with new territory comes new terminology. Here, we cover all the basics to help you wrap your head around the brave new world of SRE and all that comes with it.

SLI

Service level indicators can help check the health of your system in its current state. These are defined measurements, which help engineers prioritize different aspects of reliability.

These might include:

  • Availability
  • Error rate
  • Request latency
  • System throughput

SLIs serve as a useful link between engineers and users, ensuring the work that’s done behind the scenes is hitting the most relevant painpoints and, ultimately, having the biggest impact.

SLO

Service level objectives are very straightforward. These are targets for engineers, which are typically ambitious, but realistic. The real magic here lies in setting the bar at the right height between those two ideals. In this task, engineers have one cheat code up their sleeves: the error budget. This budget effectively determines the maximum allowable threshold for things to go wrong. SLOs are set to define the level of reliability in a system that a user would be happy experiencing.

Stay within the error budget, and changes to the product can be launched with freedom and total abandon. If the budget is exceeded, the dev team needs to reel it in and play safe – which is obviously sub-optimal for building powerful systems.

SLA

Think of service level agreements are contracts with the client and so represent the minimum that can be provided, below which the customer will be more than unhappy, they may even be litigious. These are the levels of reliability that are agreed upon between a delivery team and their client. As such, these objectives need to be entirely realistic.

Still have questions?

Get in touch