state of sre

A Brave New World for Reliability

Let’s be honest about reliability: we all know it’s never been the main attraction. Sure, it’s always been somewhere on the cast list. But never at the top, never the star of the show. For years now, other players – such as speed and security – have tended to hog the limelight. But now, something is changing.

Suddenly, in 2021, reliability is taking center stage.

Reliability – specifically SRE (site reliability engineering) has gained so much ground that it’s now shifted from being merely something you do, to become something you are. Site reliability engineers are everywhere! Or at least, vacant SRE positions are. Site reliability engineers are in great demand as businesses everywhere begin to recognize the cost of failing to invest sufficiently in good reliability.

We at Reliably thought it high time to sit back and take stock of where we are as an industry.

And it’s not just a case of boarding the SRE hype train, and telling you all how exciting and popular the space has become. (Don’t worry – you’ll see plenty of that across our channels in the coming months!)

Because we believe businesses are realizing that reliability isn’t just about, well, being reliable. It’s also clear that you can’t be fast, or offer a first-class user experience, unless your systems are reliable. For want of a better word, the recent approach to DevOps is starting to look positively holistic.

The Covid-19 pandemic has had a considerable impact on SRE practices. The global situation has in most cases exacerbated the pressures facing software engineers – a result of increased expectation, and the new challenges that come with working remotely.

So this is intended as a state of the industry address: our take on the SRE landscape, such as it is in 2021. We’re here, and we’re proud. We’ve done great things. But what’s next? Where is SRE failing to reach its full potential? And what can be done to brighten its future?

Reliably’s view for SRE in 2021

Take #1: We’re too “Ops-heavy”

Because developers need to get back to developing.

In the true spirit of DevOps, the work of a site reliability engineer was originally intended (by overseers, Google, anyway) to see a roughly 50/50 split between time spent on development and on operations. And it did start out like that. But SRE has since become considerably less well balanced.

Reports show that developers everywhere are becoming increasingly preoccupied with operations work. Some responses estimate Ops work to make up as much as 75% of their day-to-day. At the heart of that is firefighting – running around retroactively spotting and correcting problems that negatively affect the user experience. In short, the reactive approach is inefficient, and takes developers away from the positive elements of their work, where they’re best placed to use their (ample) talents.

The Reliably take on this is simple: For devs’ sake – let’s shift left! But more on that later.

This trend towards imbalance has been apparent for some years, but it seems to have grown worse since Covid-19. The increase in home working has meant that engineers are finding their roles redefined to a significant extent – taking on new (often Ops-related) tasks, with an ever-increasing workload.

Why is this happening? Well, it’s partly a result of downsizing. Regrettably, the economic impact of Covid-19 has resulted in job cuts across most sectors. Despite the industry’s general adaptability and aptitude for remote working, IT has not proven itself immune. As teams shrink, it’s somewhat inevitable that individuals’ roles are changing to pick up the slack left by departing team members.

Take #2: Workloads are getting out of control

Because devs are over-worked and facing burnout. They need help.

It isn’t just the nature of work that’s changing as a result of downsizing. It’s also leading to a sizeable increase in workload. Unsurprisingly, for SREs, this means even more pressure and even more responsibility.

This serves only to compound the frustrations of engineers who were already being squeezed in a pre-pandemic setting. More than ever, engineers are now being kept away from the positive innovation that sees their best talents put to good and proper use. As well as spending more time on operational tasks, we’re also losing more hours to toil.

Aside from major restructuring of human resources, and the re-allocation of certain tasks, the obvious solution to this growing workload is to automate. And that’s the goal of Reliably – to take the vast knowledge of the SRE community, and apply it automatically to your development set-up.

Take #3: Observability needs a bump

Because SRE demands that we view our system success differently.

Recent surveys show that site reliability engineers aren’t giving observability the love we think it deserves. While 93% of respondents cite monitoring as a central SRE tool, only 53% listed observability. Furthermore, only a tiny proportion of respondents gave weight to third-party observability (just 11% said their automated workflow extended this far). We see this as a problem.

Crucially, what this tells us is that too many engineers are looking at their systems from the inside out. In order to give us the optimal user experience, we believe engineers should be looking from the outside in.

By failing to invest in proper observability, engineers are making their lives more difficult. While monitoring can be sufficient to flag up problems in the system, it’s rarely enough to show you why these problems are occurring. Observability provides a bigger picture – not only showing engineers what’s going wrong, but also giving them the key information they need to fix it.

Take #4: Wellness should be taking priority

Because better human experience is always our end goal.

Hardly surprising that the pandemic has highlighted an industry-wide issue for well-being. Working from home may have initially seemed convenient for many of us, but it’s now beginning to take its toll. Often it’s becoming difficult to maintain a healthy work life balance, as the workplace begins to encroach upon our private space.

Coupled with an increased and changing workload, the new SRE landscape is leading to a swelling possibility of burnout among developers. Put simply, this is a hard time for devs. And with everything going on around us in the world today, further pressures in the workplace are the last thing they need.

More than ever, this raises the question of cognitive load. We should be mindful of the burden we place on our DevOps teams, ensuring that their workload is proportionate to capacity in both volume and complexity.

The future of SRE

Onto the future, then. And it’s good news!

Sure, the state of SRE may make for difficult reading at first glance – particularly in the midst of a pandemic – but we believe the future is bright. We’re a community of problem solvers, after all!

At Reliably, we actually see the current SRE landscape as a huge opportunity, in short, to shift left and change the course of reliability for the better. Here’s how our vision looks:

We want less toil. Through streamlined workflows and smart automation, we want to free up developers to show off their true talent. Let great minds once again produce great things!

We want more automation. Automating more of the reliability process helps ease the workload on developers, making them happier and more productive.

And, last but not least, we want a greater focus on well-being. While we believe that this starts with a changing approach from employers – openness to more flexible working patterns, and various measures of employee support – we also think software can play a role. In fact, that’s part of the reason we built Reliably in the first place.

However you look at it, all the detail – cognitive load, reducing toil, managing wellness, better user experience – all point to one wider movement, and that’s towards a more human way of doing things.

The shift-left in reliability is part of this human-centric approach to development. New metrics such as team cognitive load are appearing, and we’re looking for the first time to architect more manageable systems that reflect the shape of our workforce, as well as the needs of our users.

Nowhere is this “user-awareness” more evident than in Google’s Customer Reliability Engineering. As the name suggests, this is the application of SRE principles to customer experience. It’s based on the idea that the old-fashioned call center approach just doesn’t cut it anymore. We need to really be there for our customers.

Reliably has an important role to play. Developers are looking for help in achieving a better work-life balance, and any tools that can help them in this respect will prove invaluable in the months and years to come.