Eliminating Toil In SRE
What is toil in SRE?
Toil is a term coined by Google which describes the repetitive and tedious tasks associated with running a production service. Toil tends to be manual and devoid of any long-term value.
Toil is not just ‘work I do not like to do’. Each time an engineer engages with a production system, it represents time devoted to toil. These types of tasks get worse as your service grows even more extensive.
Site Reliability Engineers (SRE) should spend less time on toil. When SRE teams devote time to toil, they have less time to innovate, improve services and meet business requirements.
Not all manual tasks are considered toil.
Sending emails, placing calls, and creating expense reports are examples of overhead. Overheads do not impact production directly, and overheads are a demand for any office job. Most people in an organization, not just SREs, must also complete tasks such as reports and budgeting.
Different Attributes Of Toil
Tasks described as toils are easy to execute but do not provide a lot of value. Some attributes of toil include the following:
Toil is manual
Toil includes tasks such as clicking a button, running a script, or executing the commands in a script. Running a script reduces the amount of toil but does not eliminate it.
The time a human spends running that script is still toil time.
Toil is repetitive
It is best considered as toil when you either complete a task after long hours or repeat it more than twice.
Toil is automatable
If a machine could perform the task as well as a human, the task is seen as toil. Hence, you should automate such tasks.
Some tasks are not worth automating, and it is better to spend less than an hour fixing something that occurs only once or twice a year rather than ten to fifteen hours automating the task.
Toil is devoid of long-term value
Tasks that do not lead to the improvement of a production service are toil. If a service remains the same after completing a task, that task was toil.
Tasks that do not toil contribute value to your service in the long run.
Toil is Tactical
Toil is reactive rather than proactive. Tasks like pager alerts are toil.
Toil scales linearly as the service grows
Tasks that increase as your service grows are toil. An optimally managed and designed service can grow by at least one magnitude with no extra work.
Why Is It Important To Reduce Toil?
At least 50% of each engineer’s time should be dedicated to engineering work.
The total toil time should not exceed 50% for teams. This allows engineers plenty of time to focus on improving services and engineering which also helps reduce toil.
It is the SRE team’s job to reduce toil and scale-up services.
Reducing toil provides room for innovation. A decrease in toil also eliminates the possibility of an SRE becoming an Operations Engineer.
Google’s SREs strive to keep their toil time under 50% to devote at least half of their time to improving their systems’ reliability.
If not adequately managed, toil stack-ups to dangerous levels for the organization.
How do you calculate toil?
As an engineer, you must measure and understand toil; to be able to reduce it. The following are some of the ways you can measure toil:
- Separating toil from actual project work: You can calculate toil by defining what toil is and what is not toil then log the hours spent on each task.
- Having SREs track their toil time during their call time: SREs can keep track of the time they spend on toil and the time spent on actual project work. Keeping the time spent on toil below 50% of their call time.
- By taking surveys: Surveys are a great way to calculate toil. You can use monthly or quarterly surveys to estimate how much time you spend on various tasks. Examine your responses for patterns and prioritize based on the total amount of human time spent. If the toil time exceeds 50%, the SRE team must plan to reduce it to restore a healthy work-life balance.
Is All Toil Bad?
Toil is not always negative; some toil is unavoidable in the SRE role. New developers joining your team can learn how your system works by working on the toil associated with it.
Toil is also an avenue for people in your organization to feel satisfied by working on it since they get an immediate reward.
Some people enjoy doing small tasks. Toil is acceptable for such people and won’t bother you if you are comfortable with those small tasks. Sometimes, it is helpful for people to do such jobs in the short term. But it’s not a good idea for SREs to focus on doing toil because it does not advance the engineering part of their job.
Too much toil may lead to the following:
- Low morale and burnout of individuals
- Career stagnation
- Inability to learn new skills
- Reduction in the entire team’s productivity
- Lack of trust in the quality of work
How do you reduce toil SRE?
The following are some of the methods to reduce the amount of toil for SREs:
Automation: Toil is automatable, so SRE teams should look toward automation. Google states, ‘If a machine could accomplish the task just as well as a human or you can eliminate the need for the task, that task is toil. If the task requires human judgment, it is possible that it is not toil.’
Standardization: Lack of standardization results in a more complex IT platform, which increases toil. Minimizing the number of IT platforms in use will reduce toil.
Embrace new technologies: Introducing new technology to improve reliability and sustainability can also lead to a reduction in toil. It is also important to note that not all new technology will remedy all problems.
It is important to encourage teams to try out new technologies in small areas to see how well they perform before using new technology everywhere.