SRE at WSO2
- Muhammed Shariq
- Software Engineer - WSO2
WSO2’s site reliability engineering (SRE) practice began when the organization shifted to being a cloud-first company with software-as-a-service (SaaS) offerings such as Choreo and Asgardeo. In any cloud-based SaaS solution, uptime and reliability are of the utmost importance. The role of SRE is to ensure the availability of public-facing services by implementing highly scalable and reliable software systems, with proper monitoring and alerting systems.
This blog provides an overview of how the WSO2 SRE team was established, and their main roles and responsibilities within the company.
Having the right skill set
Typically, organizations that bootstrap their own SRE practice find it challenging to hire people with the right skill set. In theory, site reliability engineers are software engineers that produce software solutions for system operations. However, in practice, these engineers need to develop a wide range of time-consuming skills, which can be costly to an organization. Some of the core competencies required to be an effective site reliability engineer includes acquiring a deep understanding and knowledge on cloud-computing concepts, having exposure to infrastructure-as-a-service (IaaS) platforms (e.g., Azure, AWS, and GCP), infrastructure automation, monitoring tools, and security.
The making of the WSO2 SRE team
The installation experience team at WSO2 was a specialized engineering team, responsible for researching and developing solutions to deploy WSO2 products (such as WSO2 API Manager, WSO2 Enterprise Integrator, and WSO2 Identity Server) using various infrastructures. These infrastructures included Docker, Kubernetes, AWS, GCP, Azure, VMware Tanzu, etc., which required expertise in technologies such as AWS CloudFormation, Terraform, Helm, Puppet, and Ansible.
The WSO2 SRE team was built on the installation experience team that already had expertise in leveraging IaaS and cloud infrastructure, infrastructure-as-code (IaC), configuration management and continuous integration/continuous deployment (CI/CD) pipelines. So, bootstrapping the WSO2 SRE practice was not such a difficult task as it would have been for any other organization adopting cloud technologies for the first time.
All the new cloud-based solutions at WSO2 were deployed on Microsoft Azure. So, the WSO2 SRE team invested time learning the various infrastructure services offered by Azure in order to design and architect highly scalable, reliable systems. Today, the WSO2 SRE team is the go-to team for expertise on the Azure platform, and has a key role in ensuring the availability and reliability of Choreo and Asgardeo.
WSO2 SRE team responsibilities
1. Provisioning infrastructure
The WSO2 SRE team has to work closely with product teams to identify the most optimal deployment architecture for a given application. The product team first outlines the requirements, and the SRE team then comes up with the deployment architecture by considering various factors such as scalability, availability, security, redundancy, and cost. Once the deployment architecture is finalized and reviewed by Azure architects (if required), the IaC scripts are produced using Terraform and Azure Arc. The SRE team works closely with the platform engineering team on application deployment aspects such as Kubernetes/Kustomize manifests, GitOps processes, etc. These scripts and manifests are then used by the DevOps team to provision the infrastructure and deploy applications.
Aspects such as infrastructure monitors, alerting and alert groups, etc., are also baked into the IaC scripts itself. So whenever a new environment is provisioned, it is already configured with monitoring and alerting. Maintaining IaC scripts provides a reliable method to recreate identical environments in a repeatable manner, with minimal human error and guesswork.
2. Azure platform expertise
As the SRE team works with Azure services on a daily basis, the team has built its expertise around various Azure services that are used at WSO2, and has also gained a high level of competence on Azure. The SRE team works closely with the product teams and helps them to build and deploy applications in a scalable and reliable manner.
The SRE team liaises with Azure architects to identify the most optimal solutions to deploy and run applications and services on Azure.
3. Security and compliance
The WSO2 SRE team comprises a specialized SecOps sub team, responsible for security and compliance of all WSO2 cloud offerings. The SecOps team enforces stringent security measures and controls during the development, deployment, and operation stage. The SecOps team ensures that security is part of the architecture and design, and that all changes rolled out to high environments are compliant with security best practices and standards.
Additionally, the SecOps team takes the lead in defining and documenting all processes and procedures related to running the cloud solution. Some of these processes include incident management, change management, and case management.
Another responsibility of the SecOps team is to obtain various cloud-compliance standards and certifications. This team liaises with the audit firm on compliance requirements and standards, and then works with the SRE and product teams to achieve the required compliance levels.
4. Maintaining the uptime using SLIs, SLOs, and SLAs
The WSO2 SRE team is responsible for maintaining the uptime of the cloud service offerings by defining various service level indicators (SLIs) that measure the compliance of service level objectives (SLOs). Some of the SLIs that WSO2 has defined include the percentage of successful requests that are recorded and the latency duration for these requests.
Corresponding to our provided service level agreement (SLA) of our cloud offerings, we are maintaining a higher SLO internally to guarantee that we offer the specified SLA.
5. Managing toil and overheads
Toil in SRE can be identified as repetitive and mundane tasks that have to be performed. While these tasks might seem insignificant, they require effort and take away attention from the more critical tasks. To limit the distractions caused by mundane toil tasks, the WSO2 SRE team keeps track of a list of these tasks and works on automating them. Azure Automation runbooks, pipelines, and scripts are used to automate toil tasks.
6. Handling critical production incident
The WSO2 SRE team ensures that the SLOs are met for the cloud-service offerings by WSO2. The SRE team plays a critical role by resolving any issue or incidents that happen in production environments. The SRE team works on a roster, and is notified if there are any critical issues. The responsibility of the SRE team is to restore the environment to its working state, in a minimal amount of time, adhering to SLOs, thereby minimizing the downtime and impact to end users.
This blog post provides a high-level overview of how the WSO2 SRE team was formed, and their main roles and responsibilities. In our upcoming blogs, we will be covering the topics of SecOps, toil and overheads, and production incident handling processes in more detail.