-
dispatch
All of the ad-hoc things you're doing to manage incidents today, done for you, and much more!
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
The open source option from Netflix is quite popular too: https://github.com/Netflix/dispatch
Cool, thanks for this view.
I'm also intrigued by the text in this launch announcement:
> Our focus in the early days was build a hyper opinionated product to help them follow what we believe are the best practices. Now our product direction is focused on configuration and flexibility, how can we plug Rootly into your already existing way of working and automate it. This has helped our larger enterprise customers be successful with their current processes being automated.
As I have gotten more experience managing complex incidents I've come around to the idea that having a standard process you follow for big issues is somewhat more important than what the process really is.
I loved the PagerDuty response documentation ( https://response.pagerduty.com/ ) not so much because of the specifics but because it suggests they have a culture where there is a well-understood protocol they always try to follow for big problems.
I think about archery and "shot grouping" - once you learn to always land in the same place, you can move your aim to start landing somewhere else.
A number of the things that I see as valuable incident management involve having responders with a shared set of priorities. Tooling can influence how easy/hard some of these things are but it's really up to the people to do things like:
* Actually finding and fixing the problem and being sure the fix worked
* Clearly communicating the current user impact to the people who care
* Figuring out who the right responders are, and getting them in the room quickly
* Making one production change at a time with the incident coordinator's signoff, so you know which one helped and when it happened
* Helping the rest of the organization learn from what happened (you may not know what there is to learn)
Do you see room for the tooling company to also provide best-practices training, mentorship, or other kinds of support? That stuff scales less well than a web app but is arguably more important to changing a company's culture in a way that gets better user outcomes.