incident-response-docs

PagerDuty's Incident Response Documentation. (by PagerDuty)

Incident-response-docs Alternatives

Similar projects and alternatives to incident-response-docs based on common topics and language

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better incident-response-docs alternative or higher similarity.

incident-response-docs reviews and mentions

Posts with mentions or reviews of incident-response-docs. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-12-22.
  • It's not always DNS – unless it is
    2 projects | news.ycombinator.com | 22 Dec 2023
    I can’t read the blog but Pagerduty provides a good standards for handling incidents: https://response.pagerduty.com/
  • What's your incident response flow?
    2 projects | /r/sre | 27 May 2023
    If you’re after some general advice, PagerDuty’s response guide is evergreen content, as is our practical guide to incident management.
  • SRE - Process to handle incident management
    1 project | /r/devops | 7 Sep 2022
    PagerDuty has shared their process and has some great resources: https://response.pagerduty.com/
  • What happens if you cannot resolve the issue at hand?
    1 project | /r/sysadmin | 25 Aug 2022
  • Launch HN: Rootly (YC S21) – Manage Incidents in Slack
    2 projects | news.ycombinator.com | 7 Jun 2022
    Cool, thanks for this view.

    I'm also intrigued by the text in this launch announcement:

    > Our focus in the early days was build a hyper opinionated product to help them follow what we believe are the best practices. Now our product direction is focused on configuration and flexibility, how can we plug Rootly into your already existing way of working and automate it. This has helped our larger enterprise customers be successful with their current processes being automated.

    As I have gotten more experience managing complex incidents I've come around to the idea that having a standard process you follow for big issues is somewhat more important than what the process really is.

    I loved the PagerDuty response documentation ( https://response.pagerduty.com/ ) not so much because of the specifics but because it suggests they have a culture where there is a well-understood protocol they always try to follow for big problems.

    I think about archery and "shot grouping" - once you learn to always land in the same place, you can move your aim to start landing somewhere else.

    A number of the things that I see as valuable incident management involve having responders with a shared set of priorities. Tooling can influence how easy/hard some of these things are but it's really up to the people to do things like:

    * Actually finding and fixing the problem and being sure the fix worked

    * Clearly communicating the current user impact to the people who care

    * Figuring out who the right responders are, and getting them in the room quickly

    * Making one production change at a time with the incident coordinator's signoff, so you know which one helped and when it happened

    * Helping the rest of the organization learn from what happened (you may not know what there is to learn)

    Do you see room for the tooling company to also provide best-practices training, mentorship, or other kinds of support? That stuff scales less well than a web app but is arguably more important to changing a company's culture in a way that gets better user outcomes.

  • Startup guide to incident management
    1 project | dev.to | 16 Mar 2022
    There's an enormous amount of content available for organisations looking to import 'gold standard' incident management best practices -- things like the PagerDuty Response site, the Atlassian incident management best practices, and the Google SRE book. All of these are fantastic resources for larger companies, but as a newly founded startup, you're left to figure out which bits are important and which bits you can defer until later on.
  • Diary of a First-Time On-Call Engineer
    1 project | news.ycombinator.com | 14 Mar 2022
    Career-long sysadmin/SRE/SRE Team Lead, here. I've worked at large shops (10-30 million end users) and some shops where 99.98% is the SLA to prevent millions of dollars of losses in supply chain.

    First of all, I appreciated this diary because Anna took the task with a positive attitude and as a learning experience. Thanks for writing this. To see an old problem through new eyes is inspiring.

    I have numerous, "hot-take" criticisms of your current organization's practices, but I'm not sure I have all the context yet. The one suggestion I will make is: if you're not already using it - clone https://github.com/pagerduty/incident-response-docs/ and modify it to meet your organization's needs. Then, have it blessed as policy by management and train SREs and Devs on it.

    To the other comments: I see there's a lot of people here who say they'd never do the SRE job, or return to doing it. I'm not discounting your fear or feelings of burnout. Been there. But, hear me out:

    DevOps is not just about CI/CD pipelines and monitoring and Pagerduty. It's about having a culture where developers don't throw operational or security poop over a wall of confusion at sysadmin types as well as at their peers. This kind of organizational dysfunction can be devastating to a business.

    DevOps at it's best is about about empathy. One of the best places I ever worked was filled with developers who had true empathy. They realized that an error or omission in their work could would wake up their Ops team at stupid o'clock in the morning, repeatedly - leading to all the things that drive SRE's and on-call folks literally insane. They practised strict TDD.

    These developers volunteered to be second-tier on call after the ops team did triage, out of the kindness of their hearts for their coworkers. Management also led a culture of defending time to find permanent solutions to drive measured improvements in SLI.

    SRE isn't about waking up at stupid o'clock every night to press buttons. It's about having a culture of driving permanent fixes and compensating by using cost-effective and appropriate cloud architectures. It's also about leading the working agreements with engineering teams to do blameless post-incident retros together and making the work bring your teams closer instead of pushing them apart.

    I can't help but take away that a lot of you feel like On-Call heroics are what SRE is about. It's more difficult than that, but also less stressful, and simultaneously more rewarding when you get it right.

  • A note from our sponsor - InfluxDB
    www.influxdata.com | 2 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Stats

Basic incident-response-docs repo stats
8
1,009
3.0
8 months ago

PagerDuty/incident-response-docs is an open source project licensed under Apache License 2.0 which is an OSI approved license.

The primary programming language of incident-response-docs is Dockerfile.


Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com