Our great sponsors
-
Healthchecks
Open-source cron job and background task monitoring service, written in Python & Django
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
If you are ok with a Saas and if it's just scheduled jobs that you are monitoring, there are a number of monitoring tools where you tell when job completes (with a http request) and a missing ping (after a grace period) means that it failed.
I think https://deadmanssnitch.com/ may have been the original service for this.
https://healthchecks.io/ has a fairly generous free tier that I use now.
There are others that do the same thing Sentry, Uptime Robot, ...
Uptime-Kuma [1] with ntfy [2]. Most of my services expose HTTP so I just have Uptime-Kuma monitor that. But if you have something that is not exposed to the public you can still use a "push" type monitor, and in a cron job on your server(s), send heartbeat to it when everything is working.
[1] https://github.com/louislam/uptime-kuma
[2] https://ntfy.sh/
In general this evolves to a SIEM-like solution in IT or gets added to the tag menagerie in OT.
If you're focused on "notifications are bad" note that notifications are push, and pull solutions are possible. Tail logs (or journalctl) and post significant events to Redis (https://github.com/m3047/rkvdns_examples/tree/main/totalizer...) for example.
This combo does the job for me: grafana + riemann + influxdb and collectd as the main agent. collectd bundles many plugins so you can watch logs, monitor running processes or have something custom [1]. This setup is very light to start with and can scale well (up until you hit influxdb limits :D).
[1] https://github.com/mbachry/collectd-systemd
I use the `OnFailure` property to trigger a service that emails me for failed services like backups which are run as system timers + service.
I also use `failure-monitor` which is Python service that monitors `journald`.
Files on Github for those interested:
https://github.com/kylemanna/systemd-utils
Uptime-Kuma [1] with ntfy [2]. Most of my services expose HTTP so I just have Uptime-Kuma monitor that. But if you have something that is not exposed to the public you can still use a "push" type monitor, and in a cron job on your server(s), send heartbeat to it when everything is working.
[1] https://github.com/louislam/uptime-kuma
[2] https://ntfy.sh/
> So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).
I work in Netdata. Just wanted to mention that as of last release a parent node will show all children in the agent dashboard so if doing again as of today a parent netdata might have got you the birds eye view as a starting point https://github.com/netdata/netdata/releases/tag/v1.41.0