SREs, We Toil Away!
- Chiranga Alwis
- Associate Lead - WSO2
Have you ever wondered why we perform monotonous and mundane tasks on a daily basis? Performing repetitive tasks are generally time-consuming and tedious. You may be wondering what the exact relationship is between repetitive tasks and tech buzzwords like ‘toil’ and ‘overheads’. Let’s dive in and have a look at their relationship.
At Google, one of the most important measures that the Site Reliability Engineering (SRE) teams use in order to ensure their effectiveness, is determining how effectively they spend their time doing day-to-day work in achieving reliability. Google’s SREs also believe that almost half of their time spent on work should be spent on toil.
WSO2 SREs strive to limit the amount of time spent on operational work to below 50%. At least 50% of a WSO2 SREs time is allocated to long-term engineering project work instead of operational work. As the term operational work may be misinterpreted by some, in SRE, it is referred to as toil.
Google's SRE book defines toil as, “the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
According to Google’s definition, we can consider the following tasks as examples of toil:
- Handling resource quota requests
- Applying simple database changes
- Reviewing non-critical monitoring alerts
- Copying and pasting a set of defined commands from a runbook
Apart from toil tasks, an SRE may engage in administrative tasks which are not directly related to running a production-grade system. These administrative tasks are referred to as Overheads.
“If a human operator needs your system during normal operations, you have a bug. The definition of normal changes as your systems grow” - Carla Geisser, Google SRE
Google’s SRE book provides an in-depth introduction to the previously mentioned tasks.
Now, let's focus on how we can reduce or completely stop the adverse effects of toil and overheads from interfering with our day-to-day processes.
Commonly used automation options
Currently, the WSO2 Public Cloud Services are primarily based on Microsoft Azure. Therefore, the SRE team has to place a significant emphasis on utilizing Azure and its associated services such as Azure DevOps, for automating toil and overheads.
The following is a non-exhaustive list of automation options used by the WSO2 SRE team to address toil and overhead tasks.
- Azure Automation Runbook
Automation Runbooks can be used to automate Azure management tasks and to orchestrate actions across external systems from within Azure. This option is recommended when utilizing Azure resources for operational and maintenance tasks. For further information, please refer to the official documentation. - GitHub Actions.
GitHub Actions are designed to help build robust and dynamic automation tasks. We can use it to automate any tasks based on any event related to a GitHub project or can be run as a scheduled task. Compared to Azure Automation Runbooks, a key benefit of GitHub actions is the ability to execute scheduled task logic implemented with programming languages like Java and Ballerina. - Azure DevOps Pipelines
Any manual, repetitive task directly associated with the Continuous Integration (CI) and Continuous Deployment (CD) aspects of a deployment can be automated using this service.
To conclude, we ask the question, is toil always arduous and something that needs to be addressed? We can answer this question with another question; do you ever think about all the repetitive chores you perform daily after you wake up?
In essence, this task is repetitive behavior, and in this circumstance, toil is not entirely bad.
We can consider paying someone to do our chores, but that can end up being costly. However, a task like brushing your teeth is not rocket science, and it’s something that we are used to. We know the exact process, so we can get the task done easily. Likewise, there are other use cases where we cannot avoid toil, but SREs like to work with such tasks as they are low risk and easy tasks i.e. they know the exact procedure that needs to be followed.
The WSO2 SRE team is responsible for implementing processes and best practices that enable Asgardeo, our IDaaS solution; Choreo, our next generation iPaaS for cloud native engineering; and other cloud-based offerings to run in a reliable, scalable, and secure manner. For more information, visit the WSO2 homepage, events page or blog.