- Most observability data sits on disk unprocessed, delivering few insights and little value.
- Despite all the observability tools, debugging remains incredibly difficult.
- Three pillars of observability: metrics, logs, and traces, while necessary, are not sufficient.
- Idea 1: Asking users to build their visualizations is not enough; we have to think like profilers and create nuanced, clever views like flame graphs out of the box.
- Idea 2: Creating a feedback loop for troubleshooters through selective deep tracing that is closely integrated with code
As more enterprises adopt microservice architectures, they soon find that the services forming their applications are scattered across many machines. In such a highly distributed environment, gaining visibility into those systems and how they are performing resembles hunting for the proverbial needle in a haystack. Fortunately we can turn to observability to make our systems visible.
Why do we need visibility into systems? There are three main answers: system reliability, compliance, and insights into growth drivers. Although in theory all three are possible, as of now observability is primarily used for site reliability, since both compliance and growth driver tools have their own means for collecting data and rarely tap into observability data. Once there is more extensive standardization on data collection specifications, it is likely that all three types of tools will use observability data
For this article, we will focus on site reliability. Observability helps site reliability teams by providing the data and insights they need to quickly troubleshoot and fix issues or even proactively stop issues from happening in the first place.
The current state of the art in observability is not enough
To date, we have achieved reasonable success with data collection for observability, but too often data will sit in a row form (e.g. spans) in disks. Unfortunately, there is very little use for this data, and very few insights are being derived. So, only a fraction of the promised value is being delivered.
Despite all the observability tools available for site reliability, debugging remains incredibly difficult, and many site reliability engineers (SRE) would agree that their debugging processes have only marginally improved.
Observability is still relatively nascent, and, as Bogomil points out, it is likely to go through many more waves of redesign, adding more value creation at each wave.
Broadly, there are three elements of observability: collecting the right data, presenting the right data to the user, and then closing the loop by letting the engineer hone in on the problem. These are supported by the three pillars of observability: metrics, logs, and traces. However, while these pillars are necessary, they are not sufficient.
This article is an effort to describe how observability for troubleshooting could and should be done from the user’s point of view.
Collecting the right data
When we talk about observability, there are two sets of tools: specific observability tools, such as Zipkin and Jaeger, as well as broader application performance monitoring (APM) tools such as DataDog and AppDynamics.
When monitoring systems, we need information from all levels, from method and operating system level tracing to database, server, API call, thread, and lock data tracing. Asking developers to add instrumentation to get these statistics is costly and time consuming and should be avoided whenever possible. Instead, developers should be able to use plugins, interception, and code injection to collect data as much as possible.
APM tools have done a pretty good job of this. Typically they have instrumentation (e.g. Java agents) built into program languages to collect method-level tracing data, and they have added custom filter logic to detect database, server, API call, thread, and lock data tracing by looking at the method traces. Furthermore, they have added plugins to commonly used middleware tools to collect and send instrumented data. One downside of this approach is that the instrumentation needs will change as programming languages and middleware evolve.
With microservices, most processing spans multiple services and thus multiple machines. Therefore, the data we discussed so far are collected across many machines, and to understand the processing from end to end, we must propagate context through all invocations, even asynchronous calls.
For example, when a system receives a request, an identifier should be created and passed through all invocations triggered by that request. Different parts of the system should include that identifier in all metrics, traces, and logs, which enables us to later connect all processing related to the invocation. There are other ways to achieve the same result. For example, open telemetry shares an identifier between each parent process and client process, but with this approach, tracing a request end to end incurs much more overhead.
Collecting data adds overhead, and if the system handles large loads it is often not possible to measure everything. Typically, the amount of logs are already well tuned to large workloads and therefore manageable. Since we collect metrics periodically, larger loads on the system generally do not significantly increase the metrics collection overhead. By contrast, tracing overhead is directly related to the number of requests and often can’t be fully traced in the face of large loads. Even Google's Dapper system traces only 1 in every 1000 requests.
So we must sample by only tracing a few requests, such as 1 in every 100. With sampling, we can reduce overhead, but there is a chance that when an error happens, we do not have the relevant observability data among the collected data. This leads to challenges that we need to be prepared to handle.
First, if each service picks samples independently, we will only have some traces for a given request. This can lead to situations where we have only small parts of every user request, making the observability useless. To avoid this when sampling, we should pick some user requests and trace them end to end in full detail.
A second challenge is that a sampling rate like 1 in 100 is great for a service that does 5,000 requests per second, but it does not work for a service that has only 50 messages per day. A more adaptive sampling policy—such as 100 samples at most per second—will work better.
More work needs to be done to intelligently select which requests to sample. For example, one proposed approach looks at recent data and predicts what requests have the highest likelihood of an error. It then traces those requests.
Presenting the right data to the Troubleshooter
Engineers can’t keep looking at the logs or charts and waiting for a problem to occur. We need to send them alerts when potential problems emerge. Alerts can come in many forms, such as emails, SMS, or PagerDuty alerts.
When the alert arrives, engineers who use observability data for troubleshooting should not walk through all the data. It simply is not an efficient use of their time. Instead, we first need a way to detect performance anomalies and direct the attention of developers or engineers to the corresponding data. Once a user has received the alert and checks in, we should present the data in the form that is easiest to understand and point to potential anomalies.
Observability systems can achieve the detection of anomalies either through user-defined rules or with a combination of statistical analysis and machine learning. Although we are using these solutions with both approaches, as shown in the failure sketching example, the latter has room for significant improvements.
Most current tools solve data presentation problems by letting users create their own views and charts. Engineers can then decide on useful key performance indicators (KPIs) to be rendered and create their own corresponding charts, which they can use for debugging and troubleshooting.
Such charts are useful for detecting a system’s current status and noticing common recurring problems. However, they do not work well with troubleshooting processes where the engineer creates hypotheses for the cause of the problem and then iteratively drills down, verifies—and if needed, refines or changes the hypothesis.
Instead, we can learn from profilers, such as Java Flight recorder. Profilers neither give a dashboard tool nor do they ask users to create their own charts to understand the problem. Instead, they provide a few well-designed views, such as call trees, hotspot views, or memory views that help engineers troubleshoot. Observability tools can do the same.
The rest of this section describes several observability views that work. They are highly influenced by recommendations made by Cindy Sridharan  as well as my own experiences in performance troubleshooting and data science.
When an engineer knows that there is a potential problem and uses the observability system to investigate it, they need to quickly jump to the time where the potential problem may have occurred and show metrics, logs, and traces in context.
For example, the following view shows logs and metrics arranged along a timeline in a meaningful manner. Users can see what happened and how metrics behaved in a single view. Often, the remarkable pattern detection capabilities of the human eye help to connect current situations with past experiences.
We can further help the engineer by annotating potential anomalies identified through artificial intelligence (AI) directly into the view.
We can understand the value of the view above by contrasting it with alternative views. Many observability tools use an alternative view where the user can draw multiple plots and move a cursor in all of the plots simultaneously based on the time under analysis. While easier to implement, that view does not help with visual pattern recognition as much as the aforementioned view because multiple time series are scattered and do not fit into a single view as naturally. On the other hand, the proposed view shows all relevant data on the same screen, in context, making it easy for the human eye to compare and contrast.
If there are clear change points associated with the problem, the former view will significantly help the engineer. However, it may be that the problem is more scattered, and it is hard to recognize it at a particular point in time. In such cases, to find the problem, we look at aggregated data instead of a specific trace.
For example, the following flame graph shows how latency is spread between different functions aggregated across many requests.
(image from https://commons.wikimedia.org/wiki/File:MediaWiki_flame_graph_screenshot_2014-12-15_22.png)
The view presents bottlenecks as valleys, and if needed, users can compare this view over time and create a differential chart, showing how current traces differ from normal behaviors.
In some other cases, problems originate at one point and manifest across the system. Then, to find the culprit, we need a topology view showing the dependencies between the components contributing to a request.
For example, if the database has slowed down, we will see all downstream services showing high latencies and even connection stacking. Only by understanding how errors propagate through the topology, we can find the root cause.
To make the analysis simple and focused, it is useful to have a dynamically generated topology that shows only the relevant services and servers or APIs.
When we find the potential causes, we switch to traces related to problematic executions and
step through them to find the problem. At this point, it is helpful to have an integration with the source code editor where users can see the source code while stepping through the traces.
These views are detailed, connected, and refined, letting the engineer filter, drill-down and zoom into the problems. It is hard for the end-user to build this kind of rich nuanced user experience themselves. Instead, observability tools should build in those views and let the engineer customize them by selecting data feeds underlying the views. For example, it is common to draw the views above based on latency, but the tool can let users change the views to select a different measurement, such as lock wait time, to zoom into a specific problem.
Another relevant and interesting insight is what changed in the overall system before the particular problem occurred. Very often the errors are caused by something that changed in the system, such as a code update, a configuration change, dependency updates, or even a deployment change. As discussed by Bogomil , the ability to pull data from the continuous integration and continuous delivery (CI/CD) pipeline and active deployment can also greatly improve troubleshooting processes. For example, if the organization already uses GitOps, this is a matter of showing the difference between the two deployments.
Finally, the observability system itself can help the engineer by pinpointing potential root causes and explaining the evidence. Most existing tools have at least one use case, but they work only for limited types of problems. Explainable AI, AI models that can explain how they arrive at the prediction, has made some progress. However, this is still an evolving field.
Closing the loop
Troubleshooting is an iterative process, where engineers look deeply at the data around the failures, create a hypothesis, design an experiment, and test it out. If the hypothesis fails, the new data will enable them to form a new hypothesis. The cycle repeats.
With complex, scattered microservices apps, this process needs to work hand on hand with the observability system. Once a hypothesis is formed, the engineer starts the debugging process, which involves inspecting runtime status for additional information or trying out fixes.
A deep integration of observability methods will significantly streamline the process. Following are potential usages of such a deep integration.
First, to ease the debugging process, engineers often want more detailed metrics or tracing. Thus, observability systems should let these users select certain areas of code for deep tracing or send watch statements to running code, which will circle back the results to observability views.
A close integration with development environments will let engineers open the source code with one click and see selected metrics overlaid on the code. They could make changes and push them to either a staging area or a small percentage of traffic to obtain feedback from the code overlay or observability views.
A microservices architecture scatters applications across many machines, making troubleshooting more difficult. Observability tools aim to open the veil, and present detailed information about the system executions to users and help them zoom in on any problems. While they provide some assistance today, improvements in the ability to collect and present the right data will be critical if observability tools are to become effective, mainstream solutions for troubleshooting microservices architectures.
If you want more details about an IPaaS solution and want to see one in action, please visit Choreo — a comprehensive platform for low-code, cloud native engineering.
- Bogomil Balkansky, The Future of Observability, https://medium.com/sequoia-capital/the-future-of-observability-6918caaa021
- Cindy Sridharan, Distributed Tracing — we’ve been doing it wrong, https://copyconstruct.medium.com/distributed-tracing-weve-been-doing-it-wrong-39fc92a857df
This article was originally published on InfoQ on February 10, 2021.