Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
I see a lot of people talk about Prometheus on here and speak about it as though it's the only metrics gathering solution. In this way, it really does seem like it has become the poster child of Hacker News and metrics gathering.
I've used both Prometheus and the Telegraf / StatsD solutions; and, for a very long time, I've disliked everything from the standard "bugs"[0] in Prometheus to the entire design philosophy of their pull vs Telegraf and similar's push methodology.
What is the collective's general stance on Prometheus vs Telegraf; and why does the collective tend to end up preferring one over the other?
[0] For example, Prometheus clients does tend to consider a counter that hasn't been incremented to exist, so if you have an error counter, the sudden existence of the error counter is how you find an error. The 'increase' is 0, though, because it went from not existing to a value of 1. Citation: https://github.com/prometheus/prometheus/issues/1673
No, it's not technically a "bug", it's how it's designed; but, it speaks to how it's used and the work-arounds are unsatisfactory, in my opinion.
While I somehow understand Prometheus idea that pull is easier to scale than push I've had a bad luck with it.
First of all Prometheus doesn't even consider monitoring of long-running jobs other than pull way (which didn't make sense for me). There is push gateway [0] but clients libraries seem to consider it only for short-lived jobs where you can send the metrics at the end [1]. It seems I couldn't "push" from long living jobs trivially
Second when using it for example with django you have to be careful with how you handle multiprocessing that UWSGI/gunicorn does, see [2] - it has bitten me at leas once.
Comparing to push model where I can just push metrics to [3] statsd_exporter directly and be done with it, but support for statsd is lacking both in terms of frameworks (everyone seems to be migrating to native clients...) and functionality (you've to do labeling basically manually [4])
To sum up: Prometheus is really great when it works, until you try to go off-track (intentionally or not, see django [2]) then you see its all undiscovered and immature landscape
[0] https://github.com/prometheus/pushgateway
[2] https://github.com/korfuri/django-prometheus/blob/master/doc...