15
4 Key Observability Metrics for Distributed Applications
A common architectural design pattern these days is to break up an application monolith into smaller microservices. Each microservice is then responsible for a specific aspect or feature of your app. For example, one microservice might be responsible for serving external API requests, while another might handle data fetching for your frontend.
Designing a robust and fail-safe infrastructure in this way can be challenging; monitoring the operations of all these microservices together can be even harder. It's best not to simply rely on your application logs for an understanding of your systems' successes and errors. Setting up proper monitoring will provide you with a more complete picture, but it can be difficult to know where to start. In this post, we'll cover service areas your metrics should focus on to ensure you're not missing key insights.
We're going to make a few assumptions about your app setup. Don't worry—you don't need to use any specific framework to start tracking metrics. However, it does help to have a general understanding of the components involved. In other words, how you set up your observability tooling matters less than what you track.
Since a sufficiently large set of microservices requires some level of coordination, we're going to assume you are using Kubernetes for orchestration. We're also assuming you have a time series database like Prometheus or InfluxDB for storing your metrics data. You might also need an ingress controller, such as the one Kong provides to control traffic flow, and a service mesh, such as Kuma, to better facilitate connections between services.
Before implementing any monitoring, it's essential to know how your services actually interact with one another. Writing out a document that identifies which services and features depend on one another and how availability issues would impact them can help you strategize around setting baseline numbers for what constitutes an appropriate threshold.
You should be able to see data points from two perspectives: Impact Data and Causal Data. Impact Data represents information that identifies who is being impacted. For example, if there's a service interruption and responses slow down, Impact Data can help identify what percentage of your active users is affected.
While Impact Data determines who is being affected, Causal Data identifies what is being affected and why. Kong Ingress, which can monitor network activity, can give us insight into Impact Data. Meanwhile, Kuma can collect and report Causal Data.
Let's look at a few data sources and explore the differences between Impact Data and Causal Data that can be collected about them.
Latency is the amount of time it takes between a user performing an action and its final result. For example, if a user adds an item to their shopping cart, the latency would measure the time between the item addition and the moment the user sees a response that indicates its successful addition. If the service responsible for fulfilling this action degraded, the latency would increase, and without an immediate response, the user might wonder whether the site was working at all.
To properly track latency in an Impact Data context, it's necessary to follow a single event throughout its entire lifetime. Sticking with our purchasing example, we might expect the full flow of an event to look like the following:
- The customer clicks the "Add to Cart" button
- The browser makes a server-side request, initiating the event
- The server accepts the request
- A database query ensures that the product is still in stock
- The database response is parsed, a response is sent to the user, and the event is complete
To successfully follow this sequence, you should standardize on a naming pattern that identifies both what is happening and when it's happening, such as customer_purchase.initiate
, customer_purchase.queried
, customer_purchase.finalized
, and so on. Depending on your programming language, you might be able to provide a function block or lambda to the metrics service:
statsd.timing('customer_purchase.initiate') do
# ...
end
By providing specific keywords, you ought to hone in on which segment of the event was slow in the event of a latency issue.
Tracking latency in a Causal Data context requires you to track the speed of an event between services, not just the actions performed. In practice, this means timing service-to-service requests:
statsd.histogram('customer_purchase.initiate') do
statsd.histogram('customer_purchase.external_database_query') do
# ...
end
end
This shouldn't be limited to capturing the overall endpoint request/response cycles. That sort of latency tracking is too broad and ought to be more granular. Suppose you have a microservice with an endpoint that makes internal database requests. In that case, you might want to time the moment the request was received, how long the query took, the moment the service responded with a request, and the moment when the originating client received that request. This way, you can pinpoint precisely how the services communicate with one another.
You want your application to be useful and popular—but an influx of users can be too much of a good thing if you're not prepared! Changes in site traffic can be difficult to predict. You might be able to serve user load on a day-to-day basis, but events (both expected and unexpected) can have unanticipated consequences. Is your eCommerce site running a weekend promotion? Did your site go viral because of some unexpected praise? Traffic variances can also be affected by geolocation. Perhaps users in Japan are experiencing traffic load in a way that users in France are not. You might think that your systems are working as intended, but all it takes is a massive influx of users to test that belief. If an event takes 200ms to complete, but your system can only process one event at a time, it might not seem like there's a problem—until the event queue is suddenly clogged up with work.
Similar to latency, it's useful to track the number of events being processed throughout the event's lifecycle to get a sense of any bottlenecks. For example, tracking the number of jobs in a queue, the number of HTTP requests completed per second, and the number of active users are good starting points for monitoring traffic.
For Causal Data, monitoring traffic involves capturing how services transmit information to one another, similar to how we did it for latency. Your monitoring setup ought to track the number of requests to specific services, their response codes, their payload sizes, and so on—as much about the request and response cycle as necessary. When you need to investigate worsening performance, knowing which service is experiencing problems will help you track the possible source much sooner.
Tracking error rates is rather straightforward. Any 5xx (or even 4xx) issued as an HTTP response by your server should be tagged and counted. Even situations that you've accounted for, such as caught exceptions, should be monitored because they still represent a non-ideal state. These issues can act as warnings for deeper problems stemming from defensive coding that doesn't address actual problems.
Kuma can capture the error codes and messages thrown by your service, but this represents only a portion of actionable data. For example, you can also capture the arguments which caused the error (in case a query was malformed), the database query issued (in case it timed out), the permissions of the acting user (in case they made an unauthorized attempt), and so on. In short, capturing the state of your service at the moment it produces an error can help you replicate the issue in your development and testing environments.
You should track the memory usage, CPU utilization, disk reads/writes, and available storage of each of your microservices. If your resource usage regularly spikes during certain hours or operations or increases at a steady rate, this suggests you’re overutilizing your server. While your server may be running as expected, once again, an influx of traffic or other unforeseen occurrences can quickly topple it over.
Kong Ingress only monitors network activity, so it's not ideal for tracking saturation. However, there are many tools available for tracking this with Kubernetes.
Up to now, we've discussed the kinds of metrics that will be important to track in your cloud application. Next, let’s dive into some specific steps you can take to implement this monitoring and observability.
Prometheus is the go-to standard for monitoring, an open-source system that is easy to install and integrate with your Kubernetes setup. Installation is especially simple if you use Helm.
First, we create a monitoring
namespace:
$ kubectl create namespace monitoring
Next, we use Helm to install Prometheus. We make sure to add the Prometheus charts to Helm as well::
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo add stable https://kubernetes-charts.storage.googleapis.com/
$ helm repo update
$ helm install -f https://bit.ly/2RgzDtg -n monitoring prometheus prometheus-community/prometheus
The values file referenced at https://bit.ly/2RgzDtg sets the data scrape interval for Prometheus to ten seconds.
Assuming you are using Kong Ingress Controller (KIC) for Kubernetes, your next step will be to create a custom resource—a KongPlugin
resource—which integrates into the KIC. Create a file called prometheus-plugin.yml
:
apiVersion: configuration.konghq.com/v1
kind: KongClusterPlugin
metadata:
name: prometheus
annotations:
kubernetes.io/ingress.class: kong
labels:
global: "true"
plugin: prometheus
Grafana is an observability platform that provides excellent dashboards for visualization of data scraped by Prometheus. We use Helm to install Grafana as follows:
$ helm install grafana stable/grafana -n monitoring --values http://bit.ly/2FuFVfV
You can view the bit.ly URL in the above command to see the specific configuration values for Grafana that we provide upon installation.
Now that Prometheus and Grafana are up and running in our Kubernetes cluster, we'll need access to their dashboards. For this article, we'll set up basic port forwarding to expose those services. This is a simple—but not very secure—way to get access, but not advisable for production deployments.
$ POD_NAME=$(kubectl get pods --namespace monitoring -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace monitoring port-forward $POD_NAME 9090 &
$ POD_NAME=$(kubectl get pods --namespace monitoring -l "app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace monitoring port-forward $POD_NAME 3000 &
The above two commands expose the Prometheus server on port 9090
and the Grafana dashboard on port 3000
.
Those simple steps should be sufficient to set you off and running. With Kong Ingress Controller and its integrated Prometheus plugin, capturing metrics with Prometheus and visualizing them with Grafana are quick and simple to set up.
Whenever you need to investigate worsening performance, your Impact Data metrics can help orient you on the magnitude of the problem: it should tell you how many people are affected. Likewise, your Causal Data identifies what isn't working and why. The former points you to the plume of smoke, and the latter takes you to the fire.
In addition to all of the above, you should also consider the rate at which your metrics are changing. For example, say your traffic numbers are increasing. Observing how quickly those numbers are moving can help you determine when (or if) it'll become a problem. This is essential for managing upcoming work with regular deployments and changes to your services. It also establishes what an ideal performance metric should be.
15