-
We've been developing the BlueWave Uptime Manager [1] for the past 5 months with a team of 7 developers and 3 external contributors, and till today we always went under the radar.
As we move towards expanding from basic uptime tracking to a comprehensive monitoring solution, we're interested in getting insights from the community.
For those of you managing server infrastructure,
- What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage?
- Do you also keep tabs on network performance, processes, services, or other metrics?
Additionally, we're debating whether to build a custom monitoring agent or leverage existing solutions like OpenTelemetry or Fluentd.
- What’s your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
- Lastly, what’s your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?
[1] https://github.com/bluewave-labs/bluewave-uptime
-
InfluxDB
Purpose built for real-time analytics at any scale. InfluxDB Platform is powered by columnar analytics, optimized for cost-efficient storage, and built with open data standards.
-
-
Grafana
The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
Related, have there been any 'truly open-source' forks of Grafana since their license change and other shenanigans? Or does anyone know of good Grafana alternatives from FOSS devs in general? My default right now is to just use Prometheus itself, but I miss some of the dashboard functionality etc. from Grafana.
Grafana's license change to AGPLv3, combined with an experience I had reporting security vulnerabilities to them, combined with them not being a good steward for changes like this[1] left a bad taste in my mouth.
[1] https://github.com/grafana/grafana/pull/6627
-
Or Checkmk [1], which is coming from Nagios and brings thousands of plugins for nearly every hardware and service you can think of..
[1] https://checkmk.com/
-
opentelemetry-collector-builder
Discontinued A CLI tool that generates OpenTelemetry Collector binaries based on a manifest.
If I were in your position I would craft my own OTel distribution and ship it.
This is very easy to do: https://github.com/open-telemetry/opentelemetry-collector-bu...
With this approach you’re standing on the shoulders of giants, compatible with any agent that speaks OTLP, and can market your distribution as an ecosystem tool.
-
If I were in your position I would craft my own OTel distribution and ship it.
This is very easy to do: https://github.com/open-telemetry/opentelemetry-collector-bu...
With this approach you’re standing on the shoulders of giants, compatible with any agent that speaks OTLP, and can market your distribution as an ecosystem tool.
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Healthchecks
Open-source cron job and background task monitoring service, written in Python & Django
-
node_exporter ( https://github.com/prometheus/node_exporter ) and process_exporter ( https://github.com/ncabatoff/process-exporter ) expose the most of the useful metrics needed for monitoring server infrastructure together with the running processes. I'd recommend also taking a look at Coroot agent, which uses ebpf for exporting the essential host and process metrics - https://github.com/coroot/coroot-node-agent .
As for the agent, it is better from operations perspective to run a single observability agent per host. This agent should be small in size and lightweight on CPU and RAM usage, should have no external dependencies and should have close to zero configs, which need to be tuned, e.g. it should automatically discover all the apps and metrics needed to be monitored, and send them to the centralized observability database.
If you are lazy to write the agent on yourself, then take a look at vmagent ( https://docs.victoriametrics.com/vmagent/ ), which scrapes metrics from the exporters mentioned above. vmagent satisfies most of the requirements stated above except of configuration - you need to provide configs for scraping metrics from separately installed exporters.
-
node_exporter ( https://github.com/prometheus/node_exporter ) and process_exporter ( https://github.com/ncabatoff/process-exporter ) expose the most of the useful metrics needed for monitoring server infrastructure together with the running processes. I'd recommend also taking a look at Coroot agent, which uses ebpf for exporting the essential host and process metrics - https://github.com/coroot/coroot-node-agent .
As for the agent, it is better from operations perspective to run a single observability agent per host. This agent should be small in size and lightweight on CPU and RAM usage, should have no external dependencies and should have close to zero configs, which need to be tuned, e.g. it should automatically discover all the apps and metrics needed to be monitored, and send them to the centralized observability database.
If you are lazy to write the agent on yourself, then take a look at vmagent ( https://docs.victoriametrics.com/vmagent/ ), which scrapes metrics from the exporters mentioned above. vmagent satisfies most of the requirements stated above except of configuration - you need to provide configs for scraping metrics from separately installed exporters.
-
node_exporter all the way: https://github.com/prometheus/node_exporter
-
I try to monitor everything because it can get much more accessible to debug weird issues when sh*t hits the fan.
> Do you also keep tabs on network performance, processes, services, or other metrics?
Everything :)
> What's your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
I went with collected [1] and Telegraf [2] simply because they support tons of modules and are very stable. However, I have a couple of bespoke agents where neither collected nor Telegraf will fit.
> Lastly, what's your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?
We can argue to death, but I'm for push-based agents all the way down. It is much easier to scale, and things are painless to manage when the right tool is used (I'm using Riemann [3] for shaping, routing, and alerting). I used to run Zabbix setup, and scaling was always the issue (Zabbix is pull-based). I'm still baffled how pull-based monitoring gained traction, probably because modern gens need to repeat mistakes from the past.
[1] https://www.collectd.org/
[2] https://www.influxdata.com/time-series-platform/telegraf/
[3] https://riemann.io/
-
> Performance-wise e.g. VictoriaMetrics' prometheus-benchmark only covered instant queries without look back for example the last time I checked.
prometheus-benchmark ( https://github.com/VictoriaMetrics/prometheus-benchmark ) tests CPU usage, RAM usage and disk usage for typical alerting queries. It doesn't test the performance of queries used for building graphs in Grafana because the typical rate of alerting queries is multiple orders of magnitude bigger than the typical rate of queries for building graphs, e.g. alerting queries generate the most of load on CPU, RAM and disk IO in typical production workload.
Please file a feature request at https://github.com/VictoriaMetrics/prometheus-benchmark/issu... to add the ability to test resource usage for typical queries for building Grafana graphs if you think this will be a good feature.
> I am also not aware of VictoriaMetrics giving back anything to the Prometheus ecosystem (can you maybe link some examples if I am wrong?)
Sure:
- https://github.com/prometheus/prometheus/issues?q=author%3Av...
-
- https://github.com/prometheus/prometheus/issues?q=author%3Ah...
> As per recent actual examples, here's a 2 submission of the same post bashing project in the ecosystem: https://news.ycombinator.com/item?id=40838531
This submission posts a link to the real-world experience of long-term user of Grafana Loki. This user points to various issues in applications he uses. For example:
- Issues with Loki restarts - https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLok...
- Issues with structured metadata in Loki 3.0 - https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLok...
- Issues with single-node Loki setup - https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLok...
- Issues with Loki logcli command - https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLok...
- Issues with Grafana Loki data compaction - https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLok...
- Comparison of Grafana Loki vs traditional syslog server - https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLok...
As you can see, this user shares his extensive experience with Grafana Loki, and continues using it despite the fact that there is much better solution exists, which is free from all the Loki issues - VictoriaLogs. This user isn't affiliated with VictoriaMetrics in any way.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives