OpenTelemetry in 2023

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • tempo

    Grafana Tempo is a high volume, minimal dependency distributed tracing backend.

  • > It's easy to add Jaeger to your local dev stack so you can have tracing while developing.

    Tempo can be spun up with docker compose using a local disk for ephemeral storage/querying: https://github.com/grafana/tempo/blob/main/example/docker-co...

    Maybe this meets your needs?

    > Jaeger is easier to setup/manage and has a better interface than Grafana/Tempo

    What do you enjoy about the Jaeger interface? Perhaps it's a gap in Tempo we can improve.

  • signoz

    SigNoz is an open-source observability platform native to OpenTelemetry with logs, traces and metrics in a single application. An open-source alternative to DataDog, NewRelic, etc. πŸ”₯ πŸ–₯. πŸ‘‰ Open source Application Performance Monitoring (APM) & Observability tool

  • Thanks for mentioning SigNoz, I am one of the maintainers at SigNoz and would love your feedback on how we can improve it further.

    If anyone wants to check our project, here’s our GitHub repo - https://github.com/SigNoz/signoz

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • oteps

    OpenTelemetry Enhancement Proposals

  • Oh nice, thank you (and also solumos) for the links! It looks like oteps/pull/171 (merged June 2023) expanded and superseded the opentelemetry-proto/pull/346 PR (closed Jul 2022) [0]. The former resulted in merging OpenTelemetry Enhancement Proposal 156 [1], with some interesting results especially for 'Phase 2' where they implemented columnar storage end-to-end (see the Validation section [2]):

    * For univariate time series, OTel Arrow is 2 to 2.5 better in terms of bandwidth reduction ... and the end-to-end speed is 3.1 to 11.2 times faster

    * For multivariate time series, OTel Arrow is 3 to 7 times better in terms of bandwidth reduction ... Phase 2 has [not yet] been .. estimated but similar results are expected.

    * For logs, OTel Arrow is 1.6 to 2 times better in terms of bandwidth reduction ... and the end-to-end speed is 2.3 to 4.86 times faster

    * For traces, OTel Arrow is 1.7 to 2.8 times better in terms of bandwidth reduction ... and the end-to-end speed is 3.37 to 6.16 times faster

    [0]: https://github.com/open-telemetry/opentelemetry-proto/pull/3...

    [1]: https://github.com/open-telemetry/oteps/blob/main/text/0156-...

    [2]: https://github.com/open-telemetry/oteps/blob/main/text/0156-...

  • opentelemetry-proto

    OpenTelemetry protocol (OTLP) specification and Protobuf definitions

  • Oh nice, thank you (and also solumos) for the links! It looks like oteps/pull/171 (merged June 2023) expanded and superseded the opentelemetry-proto/pull/346 PR (closed Jul 2022) [0]. The former resulted in merging OpenTelemetry Enhancement Proposal 156 [1], with some interesting results especially for 'Phase 2' where they implemented columnar storage end-to-end (see the Validation section [2]):

    * For univariate time series, OTel Arrow is 2 to 2.5 better in terms of bandwidth reduction ... and the end-to-end speed is 3.1 to 11.2 times faster

    * For multivariate time series, OTel Arrow is 3 to 7 times better in terms of bandwidth reduction ... Phase 2 has [not yet] been .. estimated but similar results are expected.

    * For logs, OTel Arrow is 1.6 to 2 times better in terms of bandwidth reduction ... and the end-to-end speed is 2.3 to 4.86 times faster

    * For traces, OTel Arrow is 1.7 to 2.8 times better in terms of bandwidth reduction ... and the end-to-end speed is 3.37 to 6.16 times faster

    [0]: https://github.com/open-telemetry/opentelemetry-proto/pull/3...

    [1]: https://github.com/open-telemetry/oteps/blob/main/text/0156-...

    [2]: https://github.com/open-telemetry/oteps/blob/main/text/0156-...

  • opentelemetry-go

    OpenTelemetry Go API and SDK

  • https://opentelemetry.io

    > OpenTelemetry is a collection of APIs, SDKs, and tools. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

    You can absolutely categorize telemetry into these high-level categories, true. But the specifics on how that data is captured, exported, collected, queried, etc. is necessarily unique to each programming language, backend system, organization, etc.

    That's because telemetry data is always larger than the original data it represents: a production request will be of some well-defined size, but the metadata about that request is potentially infinite. Consequently, the main design constraint for telemetry systems is always efficiency.

    Efficiency requires specialization, which is in direct tension with features that generalize over backends and tools, e.g.

    > Traces, Metrics, Logs -- Create and collect telemetry data from your services and software, then forward them to a variety of analysis tools.

    and features that generalize over languages, e.g.

    > Drop-In Instrumentation -- OpenTelemetry integrates with popular libraries and frameworks such as Spring, ASP.NET Core, Express, Quarkus, and more! Installation and integration can be as simple as a few lines of code.

    I think OTel treats these goals -- which are very valuable to end users!! -- as inviolable core requirements, and then does whatever is necessary to implement them, even if the resulting code is unsound, or inefficient, or incoherent.

  • Grafana

    The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.

  • Grafana seems to be an option? Handles metrics, logs and traces. I don't know what storage costs look like though if you are self hosting.. https://grafana.com/

  • terraform-aws-jaeger

    Terraform module for Jeager

  • It's really not that intense. I basically set up my last co's telemetry infrastructure all by myself, using terraform, otel-python, jaeger, and AWS elasticsearch.

    This TF project does most of the heavy lift. https://github.com/telia-oss/terraform-aws-jaeger

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • openobserve

    πŸš€ 10x easier, πŸš€ 140x lower storage cost, πŸš€ high performance, πŸš€ petabyte scale - Elasticsearch/Splunk/Datadog alternative for πŸš€ (logs, metrics, traces, RUM, Error tracking, Session replay).

  • I guess you could take a look at this: https://openobserve.ai/

    It's in Rust to add some HN catnip.

  • opentelemetry-js

    OpenTelemetry JavaScript Client

  • > OpenTelemetry is a marketing-driven project, designed by committee, implemented naively and inefficiently, and guided by the primary goal of allowing Fortune X00 CTOs to tick off some boxes on their strategy roadmap documents.

    I'm the founder of highlight.io. On the consumer side as a company, we've seen a lot of value of from OTEL; we've used it to build out language support for quite a few customers at this point, and the community is very receptive.

    Here's an example of us putting up a change: https://github.com/open-telemetry/opentelemetry-js/pull/4049

    Do you mind sharing why you think no-one should be using it? Some reasoning would be nice.

  • opentelemetry-js-contrib

    OpenTelemetry instrumentation for JavaScript modules

  • [2] https://github.com/open-telemetry/opentelemetry-js-contrib/t...

  • jaeger-tempo

    Discontinued Tempo Proxy with Jaeger

  • Jaeger can use multiple backends for storage, including Tempo, so it's not an either/or situation.

    I'm fairly sure there was an official Grafana-provided Jaeger gRPC plugin for Tempo, but can't easily find it, only this one: https://github.com/flitnetics/jaeger-tempo

  • proposal-async-context

    Async Context for JavaScript

  • You can follow [0] which is currently stage 2 to fix this

    [0]: https://github.com/tc39/proposal-async-context

  • VictoriaMetrics

    VictoriaMetrics: fast, cost-effective monitoring solution and time series database

  • You shouldn't unless you want to use the new open source standard for telemetry. You won't benefit from simplicity or performance improvements. It would be quite the opposite. You can check what is the actual cost of open telemetry adoption here [0]

    But if you ever decide to go this path - VictoriaMetrics supports OpenTelemetry protocol for metrics [1]

    [0] https://github.com/VictoriaMetrics/VictoriaMetrics/pull/2570

    [1] https://docs.victoriametrics.com/Single-server-VictoriaMetri...

  • self-hosted

    Sentry, feature-complete and packaged up for low-volume deployments and proofs-of-concept

  • > What should people use?

    I recall Apache Skywalking being pretty good, especially for smaller/medium scale projects: https://skywalking.apache.org/

    The architecture is simple, the performance is adequate, it doesn't make you spend days configuring it and it even supports various different data stores: https://skywalking.apache.org/docs/main/v9.0.0/en/setup/back...

    The problems with it are that it isn't super popular (although has agents for most popular stacks), the docs could be slightly better and I recall them also working on a new UI so there is a little bit of churn: https://skywalking.apache.org/downloads/

    Still better versus some of the other options when you need something that just works instead of spending a lot of time configuring something (even when that something might be superior in regards to the features): https://github.com/getsentry/self-hosted/blob/master/docker-...

    Sentry is just the first thing that comes to mind (OpenTelemetry also isn't simpler due to how much it tries to do), but compare its complexity to Skywalking: https://github.com/apache/skywalking/blob/master/docker/dock...

    I wish there was more self-hosted software like that out there, enough to address certain concerns in a simple way on day 1 and leave branching out to more complex options like OpenTelemetry once you have a separate team for that and the cash is rolling in.

  • skywalking

    APM, Application Performance Monitoring System

  • > What should people use?

    I recall Apache Skywalking being pretty good, especially for smaller/medium scale projects: https://skywalking.apache.org/

    The architecture is simple, the performance is adequate, it doesn't make you spend days configuring it and it even supports various different data stores: https://skywalking.apache.org/docs/main/v9.0.0/en/setup/back...

    The problems with it are that it isn't super popular (although has agents for most popular stacks), the docs could be slightly better and I recall them also working on a new UI so there is a little bit of churn: https://skywalking.apache.org/downloads/

    Still better versus some of the other options when you need something that just works instead of spending a lot of time configuring something (even when that something might be superior in regards to the features): https://github.com/getsentry/self-hosted/blob/master/docker-...

    Sentry is just the first thing that comes to mind (OpenTelemetry also isn't simpler due to how much it tries to do), but compare its complexity to Skywalking: https://github.com/apache/skywalking/blob/master/docker/dock...

    I wish there was more self-hosted software like that out there, enough to address certain concerns in a simple way on day 1 and leave branching out to more complex options like OpenTelemetry once you have a separate team for that and the cash is rolling in.

  • aws-otel-lambda

    AWS Distro for OpenTelemetry - AWS Lambda

  • OpenTelemetry is being pushed as a replacement for AWS X-Ray SDKs by AWS, but it's in such a broken state for Lambda right now. A 200-500% performance penalty for using it is insane[1][2].

    [1]: https://github.com/aws-observability/aws-otel-lambda/issues/...

  • opentelemetry-lambda

    Create your own Lambda Layer in each OTel language using this starter code. Add the Lambda Layer to your Lamdba Function to get tracing with OpenTelemetry.

  • opentelemetry-specification

    Specifications for OpenTelemetry

  • Two problems with OpenTelemetry:

    1. It doesn't know what the hell it is. Is it a semantic standard? Is a protocol? It is a facade? What layer of abstraction does it provide? Answer: All of the above! All the things! All the layers!

    2. No one from OpenTelemetry has actually tried instrumenting a library. And if they have, they haven't the first suggestion on how instrumenters should actually use metrics, traces, and logs. Do you write to all three? To one? I asked this question two years ago, not a single response. [1]

    [1] https://github.com/open-telemetry/opentelemetry-specificatio...

  • Two problems with OpenTelemetry:

    1. It doesn't know what the hell it is. Is it a semantic standard? Is a protocol? It is a facade? What layer of abstraction does it provide? Answer: All of the above! All the things! All the layers!

    2. No one from OpenTelemetry has actually tried instrumenting a library. And if they have, they haven't the first suggestion on how instrumenters should actually use metrics, traces, and logs. Do you write to all three? To one? I asked this question two years ago, not a single response. [1]

    [1] https://github.com/open-telemetry/opentelemetry-specificatio...

  • community

    OpenTelemetry community content (by open-telemetry)

  • 1. Agreed. It's the sink and the house attached to it, and the docs are thin and confusing as a result.

    2. I had a similar experience to you. I wanted to implement a simple heartbeat in our app to get an idea of usage numbers. This is surprisingly not possible, which greatly confuses me given the name of the project. The low engagement on my question put me off and I abandoned my OpenTelemetry planning completely. [1][2]

    [1] https://github.com/open-telemetry/community/discussions/1598

  • semantic-conventions

    Defines standards for generating consistent, accessible telemetry across a variety of domains

  • [2] https://github.com/open-telemetry/semantic-conventions/issue...

  • proposal-explicit-resource-management

    ECMAScript Explicit Resource Management

  • In addition to this, is the new (stage 3 even!)explicit resource management proposal[0], supported by TypeScript version >= 5.2[1]

    Though I agree that async context is better fit for this generally, the RMP should be good for telemetry around objects that have defined lifetime semantics, which is a step in the right direction you can use today

    [0]: https://github.com/tc39/proposal-explicit-resource-managemen...

    [1]: https://www.totaltypescript.com/typescript-5-2-new-keyword-u...

  • In addition to this, is the new (stage 3 even!)explicit resource management proposal[0], supported by TypeScript version >= 5.2[1]

    Though I agree that async context is better fit for this generally, the RMP should be good for telemetry around objects that have defined lifetime semantics, which is a step in the right direction you can use today

    [0]: https://github.com/tc39/proposal-explicit-resource-managemen...

    [1]: https://www.totaltypescript.com/typescript-5-2-new-keyword-u...

  • veneur

    A distributed, fault-tolerant pipeline for observability data

  • This was the idea behind Stripe's Veneur project - spans, logs, and metrics all in the same format, "automatically" rolling up cardinality as needed - which I thought was cool but also that it would be very hard to get non-SRE developers on board with when I saw a talk about it a few years ago.

    https://github.com/stripe/veneur

  • odigos

    Distributed tracing without code changes. πŸš€ Instantly monitor any application using OpenTelemetry and eBPF

  • Disclaimer: I am one of the maintainers

    Many comments complain about the complexity of using OpenTelemetry, I recommend checking out Odigos, an open-source project which makes working with OpenTelemetry much easier: https://github.com/keyval-dev/odigos

    We combine OpenTelemetry and eBPF to instantly generate distributed traces without any code changes.

  • b3-propagation

    Repository that describes and sometimes implements B3 propagation

  • I've been playing with OTEL for a while, with a few backends like Jaeger and Zipkin, and am trying to figure out a way to perform end to end timing measurements across a graph of services triggered by any of several events.

    Consider this scenario: There is a collection of services that talk to one another, and not all use HTTP. Say agent A0 makes a connection to agent A1, this is observed by service S0 which triggers service S1 to make calls to S2 and S3, which propagate elsewhere and return answers.

    If we limit the scope of this problem to services explicitly making HTTP calls to other services, we can easily use the Propagators API [1] and use X-B3 headers [2] to propagate the trace context (trace ID, span ID, parent span ID) across this graph, from the origin through to the destination and back. This allows me to query the metrics collector (Jaeger or Zipkin) using this trace ID, look at the timestamps originating at the various services and do a T_end - T_start to determine the overall time taken by one call for a round trip across all the related services.

    However, this breaks when a subset of these functions cannot propagate the B3 trace IDs for various reasons (e.g., a service is watching a specific state and acts when the state changes). I've been looking into OTEL and other related non-OTEL ways to capture metrics, but it appears there's not much research into this area though it does not seem like a unique or new problem.

    Has anyone here looked at this scenario, and have you had any luck with OTEL or other mechanisms to get results?

    [1] https://opentelemetry.io/docs/specs/otel/context/api-propaga...

    [2] https://github.com/openzipkin/b3-propagation

    [3] https://www.w3.org/TR/trace-context/

  • trace-context-w3c

    W3C Trace Context purpose of and what kind of problem it came to solve.

  • I've been playing with OTEL for a while, with a few backends like Jaeger and Zipkin, and am trying to figure out a way to perform end to end timing measurements across a graph of services triggered by any of several events.

    Consider this scenario: There is a collection of services that talk to one another, and not all use HTTP. Say agent A0 makes a connection to agent A1, this is observed by service S0 which triggers service S1 to make calls to S2 and S3, which propagate elsewhere and return answers.

    If we limit the scope of this problem to services explicitly making HTTP calls to other services, we can easily use the Propagators API [1] and use X-B3 headers [2] to propagate the trace context (trace ID, span ID, parent span ID) across this graph, from the origin through to the destination and back. This allows me to query the metrics collector (Jaeger or Zipkin) using this trace ID, look at the timestamps originating at the various services and do a T_end - T_start to determine the overall time taken by one call for a round trip across all the related services.

    However, this breaks when a subset of these functions cannot propagate the B3 trace IDs for various reasons (e.g., a service is watching a specific state and acts when the state changes). I've been looking into OTEL and other related non-OTEL ways to capture metrics, but it appears there's not much research into this area though it does not seem like a unique or new problem.

    Has anyone here looked at this scenario, and have you had any luck with OTEL or other mechanisms to get results?

    [1] https://opentelemetry.io/docs/specs/otel/context/api-propaga...

    [2] https://github.com/openzipkin/b3-propagation

    [3] https://www.w3.org/TR/trace-context/

  • zipkin-api-example

    Example of how to use the OpenApi/Swagger api spec

  • Yes, I really agree, and I've gone through the same pain, but try using the alternatives that claim to be better because they have OpenAPI specifications [1]

    The example shows you how to use the swagger tool, parse the OpenAPI spec [2], auto-generate GoLang glue code, call __one__ of those auto-generated functions and log a trace.

    However, there is zero documentation, zero other examples, and I'm left scratching my head whether there's even one person in the world using this approach. I eventually ended up just directly using the service APIs [3] via REST calls.

    OTEL is painful, but the alternatives are no better :( I really wish there's some interest in this space, since SLO's and SLI measurements are becoming increasingly important.

    [1] https://github.com/openzipkin/zipkin-api-example

    [2] https://github.com/openzipkin/zipkin-api/blob/master/zipkin2...

    [3] https://zipkin.io/zipkin-api/#/

  • zipkin-api

    Zipkin's language independent model and HTTP Api Definitions

  • Yes, I really agree, and I've gone through the same pain, but try using the alternatives that claim to be better because they have OpenAPI specifications [1]

    The example shows you how to use the swagger tool, parse the OpenAPI spec [2], auto-generate GoLang glue code, call __one__ of those auto-generated functions and log a trace.

    However, there is zero documentation, zero other examples, and I'm left scratching my head whether there's even one person in the world using this approach. I eventually ended up just directly using the service APIs [3] via REST calls.

    OTEL is painful, but the alternatives are no better :( I really wish there's some interest in this space, since SLO's and SLI measurements are becoming increasingly important.

    [1] https://github.com/openzipkin/zipkin-api-example

    [2] https://github.com/openzipkin/zipkin-api/blob/master/zipkin2...

    [3] https://zipkin.io/zipkin-api/#/

  • docs

    Prometheus documentation: content and static site generator (by prometheus)

  • Prometheus text exposition format is de-facto standard used in monitoring. It would be great building an official observability standard on top it. This format is much easier to debug and understand than OpenTelemetry for metrics. It is also more efficient, e.g. it requires less network bandwidth and less CPU for transfer than Otel for metrics.

    [1] https://github.com/prometheus/docs/blob/main/content/docs/in...

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts