As systems continue to become more complex, reliability is becoming an increasingly important requirement. Organizations are quickly realizing that making reliability a critical part of their service means that other organizations will be less likely to cut costs on them.
As a result of this, the field of service reliability engineering (SRE) has grown rapidly over the past few years. In January 2022, LinkedIn ranked SRE as 21st in a list of jobs with the highest global demand across the past five years.
In this article, we’ll outline the benefits of making reliability a critical service of an organization from an SRE perspective, and how reliability and security are related to one another.
The value of reliability as a service
Reliability is the probability that a system will meet certain performance standards and yield the correct output for a specific duration.
Over the past few years, the value of reliability has increased drastically for a significant number of organizations.
Even a brief reduction in reliability can result in huge monetary losses. For instance, Amazon missed out on an estimated $34 million in sales when its website went down for just 59 minutes back in June 2021.
To try and mitigate the chances of such issues, large businesses like Amazon hire site reliability engineers (SREs) who write software to automate the oversight of large software systems. In the long run, this is a more sustainable strategy than manual intervention.
The importance of both reliability and security
System security and reliability essentially have the same job: to make the system available for users.
Security ensures that the system is doing the job it is required to do, while reliability ensures that the system is doing its job correctly. They are both dependent on one another.
While reliability focuses on the internal aspects of a system, security focuses more on the external aspects of the system. Security and reliability each have different risks, which means that you have different things to consider.
While reliability risks are typically non-malicious, security risks are caused by people who are actively trying to exploit the system.
According to James Whittaker’s Microsoft Security blog post published back in 2007, security and reliability “are different aspects of the general problem of protecting our customers”.
Designing a system for security has many of the same ideas and techniques as defining a system for reliability because reliability is directly linked to security. While many people try to choose between one or the other, the reality is that they both work together.
Impact of cost-cutting on reliability
Many organizations are tempted to cost-cut when it comes to reliability — especially in times of economic downturn. It seems like an easy way to save money, and the repercussions aren’t immediately apparent.
However, while cutting costs on reliability is likely to save money in the short term, it typically leads to an increase in breakdowns that will cost even more money—and, often even more importantly, time—to fix.
Cutting costs on reliability can be especially expensive if the organization fails to meet the Service Level Agreements (SLAs) that it has with users.
For instance, if Google’s Cloud DNS uptime drops below 95%, then it has to reimburse all affected users with 50% of their monthly bill — as you can imagine, this adds up quickly.
However, the repercussions of cost-cutting on reliability aren’t just financial. Leaving it up to Operations teams to address reliability issues can quickly lead to employees becoming overworked, which can cause feelings of disillusionment and burnout. This can lead to an increased turnover rate amongst employees.
Cost-cutting on reliability can also have huge repercussions on an organization’s reputation. It doesn’t take long to destroy a reputation if something goes wrong — even if that reputation has taken years to build. The market is competitive, and customers have many other options available to them.
What are the benefits of having reliability as a critical service?
There are many benefits of having reliability as a critical service within your organization. It will have positive repercussions on your maintenance staff, your engineers, and your customers.
Making reliability a key part of your organization helps your maintenance staff understand that you care about efficiency. This frees up their time to develop their skills within the organization and grow professionally. Given that your team is likely to have less reactive maintenance costs, it is also likely to boost their morale, reduce feelings of burnout, and decrease your organization’s turnover rate as a result.
Making reliability a critical service means that developers and engineers can devote more time to focusing on tasks that would typically be given to operations teams. This increases clarity and efficiency throughout the organization, which makes reliability a much more achievable goal.
Last but not least, having reliability as a critical service helps to optimize the customer experience. As well as helping engineers to understand what thresholds they should be meeting, implementing SLAs, SLIs, and SLOs helps to set customer expectations, and reassures them that they will be adequately compensated if these expectations are not met.
How can you make reliability a critical service of an organization?
Reliability is measured in terms of the frequency and the impact of failures. Organizations need to decide what frequency of failures they can bear without disrupting the performance of their services too much.
For instance, mission-critical systems such as online banking, electric power systems, and air traffic control systems will require a significantly higher level of reliability than a typical organization because there is much more at stake if things go wrong.
There are many ways that you can improve system reliability within your organization and make it a critical service.
First and foremost, you should make sure that communication lines are open within the organization and that you are constantly asking for feedback from your team members to make sure that they are engineering for reliability. You should also be asking for regular feedback from your customers to make sure that you are meeting their expectations and understanding their pain points.
You can stress-test your systems with chaos engineering to understand where there are gaps in your team’s communication, and you should make sure that your team monitors and logs instances so that they can avoid them in the future.
Finally, you should get into the habit of carrying out smaller, more frequent release cycles to reduce the chances of testing efforts being cut short, as this can significantly reduce reliability.
What does this mean for organizations?
Reliability depends largely on the talents of an organization’s SRE team, which means that many organizations currently are putting a lot of emphasis on growing these teams.
The 2021 Upskilling Report revealed that SRE adoption grew from 10% in 2019, to 15% in 2020, to 22% in 2021. In 2022, this figure is expected to double.
Over the next few years, we are likely to see this figure rise even further as more organizations shift their efforts to improve the reliability of their services.