The outcome of the software engineering process is invisible, which makes it hard to understand progress and reason about the produced output. It becomes especially hard when developing systems with many distributed components.
Measurements of the state and actions of a system can help make the invisible visible. This is what the term “observability” is about.
Two important terms in observability are “tracing” and “telemetry”. Tracing allows engineers to follow individual execution paths throughout a system. Telemetry on the other hand is about collecting a large amount of data that only provides insights when combined.
Tracing is used in various different ways, e.g. to understand why a system does not meet performance requirements or where failures occur. It is an important part of the toolkit used by software engineers to monitor, debug, and optimise distributed systems.
This paper describes the results of a so-called systematic multivocal literature review (MLR), which includes both peer-reviewed and grey literature. The review considers the distinctive features, popularity, advantages, and issues of a large number of tracing tools that implement the OpenTracing API (which has since been replaced by OpenTelemetry).
The review covers a total of 30 tracing tools. Only 12 of these tools are fully open source. A few tools offer a free tier with limited features.
The table below shows the software license, supported programming language, pricing models, and year of first release of each tracing tool. Programming languages marked with an asterisk support so-called non-invasive instrumentation. This means that code can be automatically modified so that tracing information is sent to the tool.
|Appdash||MIT||Go*, Python*, Ruby*||Free||2014|
|Containiq||Proprietary||C/C++*, Go*, Rust*, Python*, Ruby*, Node.js*||Paid; Quote||2021|
|Dynatrace||Apache-2.0||C++*, .NET*, Erlang*, Go*, Java*, Node.js*, Python*, Ruby*, Rust*||Paid||2005|
|ElasticAPM||Apache-2.0, BSD-2-Clause, BSD-3-Clause, Elastic-2.0, MIT||Go*, Python*, iOS*, Java*, Node.js*, PHP*, Ruby*, Gherkin||Paid||2012|
|Grafana tempo||AGPL-3.0-only||Java*, Go*, .NET*, Python*, Node.js*||Free||2020|
|Haystack||Apache-2.0||Java*, Node.js*, Python*, Go*, HCL, Shell, Smarty||Free||2017|
|Hypertrace||Traceable Community License Agreement (1.0)||Java*, Go*, Python*, Node.js*, C++*, .NET*||Free||2020|
|Jaeger||Apache-2.0||Go*, Java*, Node.js*, Python*, C++*, C#*||Free||2016|
|Kamon||Apache-2.0||Java*, Scala*||Free; Paid; Quote||2017|
|Lumigo||Apache-2.0||Python*, Node.js*, Java, Go||Paid; Quote||2018|
|OpenCensus||Apache-2.0||Python*, Node.js*, Go*, C#*, C++*, Erlang*, Java*||Free||2017|
|Splunk||Apache-2.0||Python*, Java*, Node.js*, .NET*, Go*, Ruby*, PHP*||Paid||2003|
|Site24x7||BSD-2-Clause, MIT||Java*, .NET*, Ruby*, PHP*, Node.js*, Python*||Paid||2006|
|Tanzu||Apache-2.0||Java*, C++*, Go*, .NET*, Python*, Ruby||Free||2019|
|Uptrace||BSD-2-Clause, Apache-2.0||Go*, Node.js*, .NET*, Ruby*, Python*||Paid||2021|
Tracing tools can consist of several components:
Libraries are used in source code to send data to an agent or directly to a collection component.
Agents are responsible for collecting data for a particular context, e.g. an application, the operating system, or a database. They run as part of applications or as a separate component and forward data to collection components.
Collectors persist data to a long-term storage component, like a time-series database. To improve performance, this can be done through a transport component that fulfils routing or caching tasks.
Data processing components analyse incoming data and prepare it for usage in visualisations, dashboarding, and alerting.
In practice, most tracing tools only include some of these components. The table below shows which components are included with each tool:
The primary purpose of tracing tools is to collect data that allows users to see how a request traverses different services. However, many tools also collect metrics and logs, which can be incredibly helpful when observing traces:
For interoperability, it’s not only important that a tool supports as many programming languages as possible, but also has a documented API, provides support for OpenTelemetry, and can be self-hosted:
Only three tools have been cited by more than 10 papers: Zipkin (29), Jaeger (18), and LightStep (10). That doesn’t mean these are the most popular tools, however.
A search on technology-based social media platforms reveals that it’s actually Splunk, Haystack, and Sentry that are the most popular, followed by New Relic and Datadog. The top 10 tools (which also include Zipkin, Jaeger, OpenTelemetry, Dynatrace, and AppDynamics) together take up over 90% of social media coverage of tracing tools.
Not all social media coverage is positive. A sentiment analysis on online texts about tracing tools provides some insight into how much “appreciation” the community has for each of the 10 most popular tools:
|Tool||Positive (%)||Neutral (%)||Negative (%)|
Online discussion about tracing tools is often related to several topics:
Architecture, e.g. ability to scale well in a microservice architecture
Deployment & Integration, e.g. the ability to deploy tools in cloud-based and containerised infrastructures without downtime
Development, e.g. the effect of the tool on DevOps and collaboration, resource usage, and troubleshooting
Measurement, e.g. measuring the performance of microservice architectures via application metrics, and aggregation
Tracing, e.g. real-time data, distributed tracing, error monitoring, incident notifications, and ability to identify performance bottlenecks
Usability, e.g. enhancing developer productivity, downtime reduction, flexibility, security, reliability, and user experience in general
By assessing the sentiment for texts about each topic, we get an idea of the strengths and weaknesses of a tool. The table below summarises the topic sentiment for each tool. A complete overview can be found in the original article, in Tables 11 and 12.
|Deployment & Integration||✅||✅||✅|
Note that based on these results one cannot conclude that any of these tools is clearly better than the rest. Different tools provide different features that suit teams with different preferences. Some might only consider self-hosted solutions, while others may prefer hosted solutions with commercial support. Moreover, the tool ideally needs to support the programming languages that are used by the team.
At the same time, there are still many aspects of tracing tools that have not been covered by this review, so make sure you always do your own research!
- There are 10 popular tracing tools that all have their own strengths and weaknesses, but there is no clear winner