The Importance of Distributed Tracing and Monitoring in a Microservice Architecture

May 5, 2020

Microservices have many advantages, such as the ability to independently deploy decoupled services instead of having to deploy an entire monolithic application or scaling out components independently. However, for all their advantages, microservices also introduce complexities that should be understood by teams that choose to implement them. One of those challenges is monitoring application health and tracing traffic flows through a distributed system.

Often microservices can logically be thought of as a single application, even though internally there may be several services involved in fulfilling requests to the end-user. If each request is routed through a series of services, how do teams trace and troubleshoot issues? Even if the system as a whole is working functionally, perhaps your overall transaction time is unacceptable. How do teams diagnose latency between microservices or collate logs from dozens of loosely coupled services, while also ensuring that any logging overhead is kept to a minimum?

Logging, Monitoring, and Tracing

Before we continue, let us disambiguate logging, monitoring, and tracing. These terms are often used interchangeably, but there are some subtle differences.

Typically, logging is focused on diagnosing errors or providing auditing capabilities at the component (or microservice) level. Logging is also usually reactive in nature and is often the first tool that operators use when troubleshooting service errors or investigating security incidents.

Monitoring is focused on proactive metrics and thresholds that let operators know how a service is handling its requests. These metrics are usually focused on the underlying infrastructure, such as CPU, Memory, and I/O in addition to the runtime metrics generated by the application itself. These include statistics like heap size, thread count, and memory management (garbage collection).

Tracing is typically focused on optimization and performance across multiple services. It includes correlating a single “logical request” to multiple physical requests as they propagate through multiple services. While tracing is concerned with each service in a chain, it is usually focused at the application level. There are additional benefits to tracing as a troubleshooting tool since any exception or error is usually captured along with the entire context of a request.

While logging is important, in this article we will focus on monitoring and tracing since they span multiple services.

Monitoring

Monitoring is the process of recording information, or predefined metrics, allowing operators to achieve visibility into their applications state. These metrics help answer questions around resource allocation or to determine if requests are going where they should. This becomes increasingly important when traffic shifting and other advanced routing mechanisms like throttling or circuit breakers are used within microservice applications. Knowing when a request hit a service is important but knowing why it was forwarded to that service can be just as important. Monitoring can also help operators determine when circuit breakers are triggered on or off or which traffic shifting rule was invoked in routing the request. Traefik generates metrics around these types of Key Performance Indicators (KPIs) and exposes that data through various implementations such as Prometheus.

Tracing

Tracing starts at the entry-point of a request into an application. A trace is started for the request and will have a unique identifier generated for that request. As traffic flows from service to service, each service adds some information to the trace, like the time the request arrived at the service as well as how long it took to process. This allows open source tools such as Jaeger and Elastic APM to visualize the entire call flow.

How is tracing implemented in a microservices application? In most cases, there are libraries and tools to help instrument the most popular application runtimes and frameworks. You could also code it yourself, intercepting calls and adding headers to downstream requests or using some other mechanism to add metadata to traffic. In addition to these techniques, you can utilize tracing with Traefik and Maesh to gain additional observability within your application environments.

Traefik supports tracing via OpenTracing, an open standard designed for distributed tracing. You enable tracing via configuration and can also specify which backend you want to utilize: Jaeger, Zipkin, or DataDog.

Tracing or Monitoring, or Both?

The distinction between monitoring and tracing is often academic. Some teams utilize monitoring and metrics alongside distributed tracing, while other teams prefer to keep tracing and monitoring as separate but complementary concerns. In many cases, the aggregate data provided by tracing generates the information they need to determine when and where to scale out their services.

Monitoring is typically easier to implement, so teams usually start their diagnostic journey with metrics. Tracing allows a deeper (or wider) view but typically requires more effort to implement because of the requirements in collecting, storing, and analyzing the vast amounts of telemetry which is generated by tracing. Of course, there are vendors such as DataDog and Instana which can offload most of that work, but those solutions are costly and hard to justify when first starting out.

Both monitoring and tracing can be used to help detect when individual services are not behaving as they should. Tracing, when properly implemented, will help you not only detect anomalies but give you the information needed to understand what may be causing them. Ultimately, when you are to the point where you need to focus on optimization and improve end-to-end performance, then you’re going to have no choice but to utilize tracing. Tools such as Traefik and Maesh can be used to introduce distributed tracing without the significant overhead involved with instrumenting every service with open source tools and having to manage all that additional telemetry.

Conclusion

Both monitoring and tracing are important to creating stable, reliable, and performant microservice-based applications. Monitoring provides health checks on platform services and critical infrastructure, while tracing allows you to diagnose end-to-end traffic for requests. As applications mature, they typically require both monitoring and tracing for efficient and optimal service management.

Want to learn more about microservice architecture best practices? Check out this white paper that addresses production challenges (including tracing and monitoring) related to adopting microservices with a cloud-native mindset.