All posts by Vanjikumaran Sivajothy

Optimizing “Mean Time To Detect” (MTTD) for WSO2 Incidents

The “L1” incident is a term that both WSO2 customers and employees pay high attention to. This term is used to represent the “Catastrophic Severity Level” incident and it’s also known in the industry as a High Severity Incident. Most of the time when “L1” is observed, a WSO2 customer has experienced a substantial loss of service, placing a substantial portion of the Subscriber’s revenue at risk of loss or business operations have been severely disrupted.

WSO2 Support portal

WSO2 support and product engineers attend L1 incidents with high priority and provide the necessary responses within 1 hour, work around for the mitigation within 24 hours of the report of the incident and resolution within 48 hours. Furthermore, these high severity issues are closely monitored by the WSO2 executive team and when the WSO2 team recognizes a mission-critical issue, the WAR room will be deployed with the SWAT team consisting of product experts, specialists, and technical owners to ensure the resolution of the issue.

With over 15 years of experience in the integration industry, we know that some incidents may require hours, days, weeks or months to identify the root cause. It may be a product related issue or an environment related issue. It is the nature of the middleware to receive the blame all the time, as middleware product sit in the middle of services and consumers of services.

Most of the time everyone focuses on optimizing the Mean Time To Recovery (MTTR). MTTR is a metric that indicates the time it takes a system to revert to normal production following an incident. Mean Time to Resolution refers to the time taken to address the root cause and also the time taken for any proactive measures to stop the incident from occurring again. There is another spectrum in this big picture known as Mean Time To Detection (MTTD). This is a Key Performance Indicator (KPI) and it is a measure of how long a problem exists in the ecosystem before the appropriate parties become aware of it and take any necessary action towards resolving the issue, which includes finding the root cause. In this blog, I provide some guidelines on how you can identify high severity issues or incidents in the production system and show how you can organize your efforts to reduce MTTD by applying industry best practices.

As WSO2 helps organizations to move from a monolithic architecture to a distributed Microservices Architecture along with cloud-native adoption, this environment includes thousands of components interacting in complex, rapidly changing deployments over multiple tiers. Therefore, there will be a large number of events, matrices, and data produced in each and every node. Today’s dynamic cloud-native environments use multiple different technologies and aggregated tools. Currently in the industry, there are two techniques utilized for monitoring: observability and surveillance.

If there is no proper system to monitor or observe the ecosystem, organizations will never be able to detect or resolve damaging problems speedily. Hence, I introduce the tooling, monitoring tools, KPIs, alerting mechanisms, and observability techniques to significantly reduce the MTTD.

Observability

Observability is the critical pillar for reducing MTTDs. By definition observability refers to the collection of diagnostics data across all stacks to identify and debug production problems, as well as provide critical signals about usage to enable a highly adaptive and scalable system. Observability is primarily driven by five different dimensions to understand the environment and these include:

  • Monitoring
  • Tracing
  • Log aggregation
  • Visualizing
  • Alerting

Let us now take a closer look at each of these dimensions.

Monitoring

A fundamental aspect of monitoring is to collect, process, aggregate, and display real-time quantitative data about an ecosystem and measure metrics at three levels: network, machine, and application. Such monitoring will produce error counts and types, processing times, memory usage, and server lifetimes.

JConsole view for CPU and memory usage

For example, monitoring performance of the given JVM based application can simply be monitored using JConsole and you can collect matrices of CPU usage, memory usage, number of threads running, etc.

Thread monitoring

However, with larger enterprises with distributed applications, it is not feasible to only target the monitoring of a single JVM or machine. Instead, Application Performance Monitoring (APM) tools should be in place to facilitate the monitoring of multiple functional dimensions. For example, DynaTrace, AppDynamics, New Relic, Datadog, and Apache Skywalking are full-fledged monitoring and analytics capability providers that allow APM.

WSO2 API Manager profiles and WSO2 Enterprise Integrator interaction view in AppDynamics

Tracing

Traditionally, monolithic applications employed logging frameworks to provide insight on what has happened if something failed in the system. Looking at the log statements with correct timestamps and context is more than enough to understand or recognize the failure and most of the information will be revealed if the logs are correctly defined during development. However, with distributed Micro Service Architecture, logs alone are not enough to understand and see the big picture.

Tracing can be easily understood with an analogy of a medical angiogram. An angiogram is a technique used to find the blocks in the heart by injecting an x-ray sensitive dye which makes block detection possible through dynamic x-ray snapshots while the dye moves through blood vessels. Detecting the bottlenecks in this manner will be utilized to take any necessary action to fix the issue, rather than searching everywhere or replacing the entire heart.

Medical angiogram

Likewise, tracing is heavily utilized in distributed software ecosystem to profile and monitor the communication or transaction between multiple systems including networks and applications. Furthermore, tracing also helps to understand the flow between services with an overview of application-level transparency. Zipkin, Jaeger, Instana, DataDog, Apache Skywalking, and Appdash are few examples that enable distributed tracing tools which support the OpenTracing Specification.

Log aggregation

There are endless different varieties of logs like application logs, security logs, audit logs, access logs, and more. In a single application, the complexity of all these logs is manageable. However, in a distributed architecture, there are many applications or services working together to complete a single business functionality. For example, ordering a pizza involves checking the store availability, making the payment, placing the order, fulfilling the order, enabling tracking, shipping schedule placement, and many other activities.

In the event of an error in such a complex transaction, tracing may pinpoint the location to search for the root cause. However, if application-centric logs are distributed across different components, it will be a nightmare to find the exact issue and time taken to find the relevant logs could make the situation more critical. Therefore, having a centralized location to collect and index all the logs that belong to an enterprise is critical to ensure more efficient detection of the exact location of an issue.

Currently, there are multiple tools and software in the market to achieve log aggregation. Splunk, Sumo Logic, Elastic, and GrayLog play important roles in the log aggregation market.

As previously stated, one of the main responsibilities of log aggregators is the collection and storage of logs from multiple sources. As shown in the image below, logs are collected from different containers in the given Kubernetes pod.

Logs view in Kibana

Searching through the log from multiple sources in one single centralized place enables an enterprise to locate the necessary information in a short amount of time and reduces the MTTD.

String search in Kibana

Indexing log aggregators allows organizations to systematically organize logs in a given category. Most of the time, logs are indexed according to the source it originates from, such as the application name, hostname, datacenter, or IP.

Indexing in Splunk for WSO2 API

Visualization

There are tools that collect the data, logs, or matrices in a centralized location. However, if the collected data and logs do not provide any meaningful information, they will be not useful. Most APM tools and log aggregators provide data visualization to depict a holistic view based on the criteria provided.

For example, locating the host with the most number of error messages can be identified easily with visualization.

Abusive usage of the services

Another epic example is correlating two different errors that took place on separate hosts or applications and these can be created using time series aggregation charts.

Visualization of two errors that can be correlated

Data visualization is not only restricted to errors and exceptions, but it can also be used to understand the behavioral monitoring of application users. For example, if a user over-uses an API, data visualization can help to detect abusive behavior.

Alerting

Searching for log and data can be helpful to speed up the debugging process and resolving issues. But in reality, manually monitoring visualizations to detect incidents is not practical. Hence, is it critical to create automated alerts.

Common scenarios that require alerts include the sudden failure of one or more APIs, a sudden increase in the response time of one or more APIs, and the change in the pattern of API resource access. These alerts can result in an email, a phone call, instant message or PagerDuty. It’s important to note that with alerts, when a predefined condition is met or violated, necessary stakeholders need to be informed with the right amount of information rather than too much data.

Alert email from AppDynamics on Gateway Behavior

Surveillance

Collecting data in a random manner with different views of the same random data does not really reveal anything at all.

Image by StockSnap from Pixabay

Real-world surveillance is used to monitor activities by the police or security organizations, and later may be used as evidence of crimes. Likewise, surveillance is used to force the targeted observation of the system to ensure that functionalities and performance are not violating the intended behavior.

Image by edwardbrownca from Pixabay

Let’s take an example of applications that are handling real-time traffic or processing a high payload and tends to be memory intensive. The probability of an application consuming too much memory is high and if the application is not properly designed and developed to handle this, the application may use up too much memory and crash. Detecting these leaks or abnormal memory usage is critical to uninterrupted service.

Memory leak detection in AppDynamics

Conclusion

Optimizing infrastructure for minimizing the mean time taken to detect WSO2 incidents ensure that an organization has established appropriate systematic techniques to employ observability and surveillance technologies effectively to identify incidents right away and keep the system stable.

References

Reducing MTTD for High Severity Incidents (Published by O’Reilly Media, Inc.)

https://blog.twitter.com/engineering/en_us/a/2013/observability-at-twitter.html

https://blog.twitter.com/engineering/en_us/a/2016/observability-at-twitter-technical-overview-part-i.html

https://blog.twitter.com/engineering/en_us/a/2016/observability-at-twitter-technical-overview-part-ii.html

https://medium.com/observability/microservices-observability-26a8b7056bb4

https://medium.com/netflix-techblog/lessons-from-building-observability-tools-at-netflix-7cfafed6ab17

https://landing.google.com/sre/sre-book/toc/index.html

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure