docs
incident-response-docs
docs | incident-response-docs | |
---|---|---|
3 | 8 | |
598 | 1,009 | |
0.2% | 0.8% | |
9.0 | 3.0 | |
10 days ago | 9 months ago | |
Dockerfile | Dockerfile | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
docs
- Stop manually iterating over your test data in your unit tests! - Parametrized tests in C# using MSTest, NUnit, and XUnit
-
Unlock the Power of Unit Testing: A Beginner’s Guide to Quality Software Development
This is a basic example of how to create an NUnit unit test for a simple API in a controller with C#. You can find more information and resources on the NUnit website and in the NUnit documentation.
-
How to Use NUnit Annotations For Selenium Automation Testing [With Example]
We have used NUnit 2 and further blogs on NUnit3. We will have the blog content updated mentioning explicitly that the Annotations being described are those in NUnit 2. We are aware about the changes in NUnit 3 (particularly https://github.com/nunit/docs/wiki/SetUp-and-TearDown-Changes) and we will definitely plan to have a similar blog on annotations in NUnit3.
incident-response-docs
-
It's not always DNS – unless it is
I can’t read the blog but Pagerduty provides a good standards for handling incidents: https://response.pagerduty.com/
-
What's your incident response flow?
If you’re after some general advice, PagerDuty’s response guide is evergreen content, as is our practical guide to incident management.
-
SRE - Process to handle incident management
PagerDuty has shared their process and has some great resources: https://response.pagerduty.com/
- What happens if you cannot resolve the issue at hand?
-
Launch HN: Rootly (YC S21) – Manage Incidents in Slack
Cool, thanks for this view.
I'm also intrigued by the text in this launch announcement:
> Our focus in the early days was build a hyper opinionated product to help them follow what we believe are the best practices. Now our product direction is focused on configuration and flexibility, how can we plug Rootly into your already existing way of working and automate it. This has helped our larger enterprise customers be successful with their current processes being automated.
As I have gotten more experience managing complex incidents I've come around to the idea that having a standard process you follow for big issues is somewhat more important than what the process really is.
I loved the PagerDuty response documentation ( https://response.pagerduty.com/ ) not so much because of the specifics but because it suggests they have a culture where there is a well-understood protocol they always try to follow for big problems.
I think about archery and "shot grouping" - once you learn to always land in the same place, you can move your aim to start landing somewhere else.
A number of the things that I see as valuable incident management involve having responders with a shared set of priorities. Tooling can influence how easy/hard some of these things are but it's really up to the people to do things like:
* Actually finding and fixing the problem and being sure the fix worked
* Clearly communicating the current user impact to the people who care
* Figuring out who the right responders are, and getting them in the room quickly
* Making one production change at a time with the incident coordinator's signoff, so you know which one helped and when it happened
* Helping the rest of the organization learn from what happened (you may not know what there is to learn)
Do you see room for the tooling company to also provide best-practices training, mentorship, or other kinds of support? That stuff scales less well than a web app but is arguably more important to changing a company's culture in a way that gets better user outcomes.
-
Startup guide to incident management
There's an enormous amount of content available for organisations looking to import 'gold standard' incident management best practices -- things like the PagerDuty Response site, the Atlassian incident management best practices, and the Google SRE book. All of these are fantastic resources for larger companies, but as a newly founded startup, you're left to figure out which bits are important and which bits you can defer until later on.
-
Diary of a First-Time On-Call Engineer
Career-long sysadmin/SRE/SRE Team Lead, here. I've worked at large shops (10-30 million end users) and some shops where 99.98% is the SLA to prevent millions of dollars of losses in supply chain.
First of all, I appreciated this diary because Anna took the task with a positive attitude and as a learning experience. Thanks for writing this. To see an old problem through new eyes is inspiring.
I have numerous, "hot-take" criticisms of your current organization's practices, but I'm not sure I have all the context yet. The one suggestion I will make is: if you're not already using it - clone https://github.com/pagerduty/incident-response-docs/ and modify it to meet your organization's needs. Then, have it blessed as policy by management and train SREs and Devs on it.
To the other comments: I see there's a lot of people here who say they'd never do the SRE job, or return to doing it. I'm not discounting your fear or feelings of burnout. Been there. But, hear me out:
DevOps is not just about CI/CD pipelines and monitoring and Pagerduty. It's about having a culture where developers don't throw operational or security poop over a wall of confusion at sysadmin types as well as at their peers. This kind of organizational dysfunction can be devastating to a business.
DevOps at it's best is about about empathy. One of the best places I ever worked was filled with developers who had true empathy. They realized that an error or omission in their work could would wake up their Ops team at stupid o'clock in the morning, repeatedly - leading to all the things that drive SRE's and on-call folks literally insane. They practised strict TDD.
These developers volunteered to be second-tier on call after the ops team did triage, out of the kindness of their hearts for their coworkers. Management also led a culture of defending time to find permanent solutions to drive measured improvements in SLI.
SRE isn't about waking up at stupid o'clock every night to press buttons. It's about having a culture of driving permanent fixes and compensating by using cost-effective and appropriate cloud architectures. It's also about leading the working agreements with engineering teams to do blameless post-incident retros together and making the work bring your teams closer instead of pushing them apart.
I can't help but take away that a lot of you feel like On-Call heroics are what SRE is about. It's more difficult than that, but also less stressful, and simultaneously more rewarding when you get it right.
What are some alternatives?
ohmyform - ✏️ Free open source alternative to TypeForm, TellForm, or Google Forms ⛺
pagerduty2zabbix - Update Zabbix events with PagerDuty incident changes via WebHook (2-way ack).
blog-data-driven-testing
postmortem-docs - PagerDuty's Public Postmortem Documentation
NUnit - NUnit Framework
security-training - Public version of PagerDuty's employee security training courses.
docker-sonarr