What is site reliability engineering?
Site reliability engineering (SRE) does the work that would typically be done by operations but instead uses engineers with software experience to solve problems.
The concept of SRE was created by Google in 2003 after a team of software engineers was asked to make Google’s sites more scalable, reliant, and efficient. They described SRE as ‘when you treat operations as if it’s a software problem’.
The practices developed by this initial team were so successful that other tech companies began to adopt them. It has now become standard practice for companies to hire SRE teams who use software as a tool to manage systems, solve problems, and automate operations tasks.
What does a site reliability engineer do?
Google states that the ultimate goal of site reliability engineers (SREs) is to ‘automate their way out of a job.’ To help achieve this, it suggests that SREs should spend a maximum of 50% of their time on operations and that they should monitor this to make sure it is not exceeded.
SREs work with both development and operations. They essentially do work that has typically been done by an operations team – except they use their software expertise to substitute human labor for automation.
An SRE pyramid, also known as the ‘Dickerson pyramid,’ provides a set of principles that an organization can use to define and improve reliability to promote engineering excellence.
Core SRE skills
Some of the core skills required by SREs include:
- Extensive knowledge of version-control
- Good business analysis skills
- Good team working skills
- Good written and verbal communication skills
- A good understanding of how DevOps works, and best practices
- Fluency in technical language so that they can successfully pitch ideas to project stakeholders.
Do SREs write code?
The role of SREs is a challenging one that requires a strong interest in both programming and automation. SREs should be able to understand code, and they should also be good at writing code from scratch.
In addition, the role requires the following skills:
- Gathering project requirements from stakeholders
- Analyzing potential risks and suitable countermeasures to mitigate them
- Calculating the cost of potential outages, and implementing suitable contingency measures
- Analyzing the performance of systems in production by monitoring them.
What’s the difference between a site reliability engineer vs software engineer?
The key difference between SREs (site reliability engineers) and SWEs (software engineers) is that the primary goal of an SRE is to maintain the reliability of the software, while the key goal of SWEs is to create and design the software.
While SWEs often have a lot of different variables to take into consideration – such as the time taken to write the software, the cost of deployment, and the ease of updating the software – SREs are focused on improving the efficiency of incident resolution.
How much do site reliability engineers earn?
Given that SREs are typically found at high-performing tech companies that are willing to pay high salaries to avoid multi-million dollar losses, it is a competitive market.
According to Payscale, SREs can expect to earn between $77,000 and $158,000, depending on experience. The average salary for SREs is $118,552. This is 28.98% higher than the average salary for SWEs, which is $88,540.
What qualifications do SREs need?
SREs will typically need to have experience within the fields of computer science or software engineering, as well as experience with programming. A degree in computer science or another technical science is often preferred but is not always a requirement.
For more information on the certifications needed for specific SRE roles, take a look at some of the job postings on Google Careers. Google also has many resources suitable for SREs available on its website.
Building a culture of reliability
Building a culture of reliability means an organization that works towards shared goals that focus on the availability of services, processes, and people.
Since Reliability is customer-centric, it’s important for organizations to create shared goals and objectives across teams that focus on delivering great services to their customers.
Another key cultural shift required in many teams is the idea that we should be encouraged to learn from failures and shift the responsibility left.
Who is responsible for service reliability?
Ultimately, service reliability within an organization should be a joint effort between the operations team and the engineering team.
When it comes to improving service reliability, getting everyone to accept accountability and ownership on their part is easier said than done. Everyone within the organization should understand how they as an individual can impact reliability, and leaders should educate everyone on the importance of reliability and shift-left culture.
Top-down leadership is required to make this shift. It must consistently be made clear that reliability is a priority for the organization.
Should you consider hiring SREs?
One hour of downtime on Prime Day back in 2018 cost Amazon an estimated $100 million in lost sales. While the impact of an hour of downtime might not be quite so drastic for your business, it’s important to calculate the cost of your outages.
If your company infrastructure is constantly expanding and you want to continue to support existing products and services and shield yourself from potential outages while continuing to release new features, then it might be time to consider hiring SREs.