You Broke Reddit: The Pi-Day Outage

This page summarizes the projects mentioned and recommended in the original post on /r/RedditEng

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • enhancements

    Enhancements tracking repo for Kubernetes

  • The nodeSelector and peerSelector for the route reflectors target the label `node-role.kubernetes.io/master`. In the 1.20 series, Kubernetes changed its terminology from “master” to “control-plane.” And in 1.24, they removed references to “master,” even from running clusters. This is the cause of our outage. Kubernetes node labels.

  • terraform

    Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

  • The restore for the first node was completed successfully, and we were back in business. Within moments, nodes began coming online as the cluster autoscaler sprung back to life. This was a great sign because it indicated that networking was working again. However, we weren’t ready for that quite yet and shut off the autoscaler to buy ourselves time to get things back to a known state. This is a large cluster, so with only a single control plane node, it would very likely fail under load. So, we wanted to get the other two back online before really starting to scale back up. We brought up the next two and ran into our next sticking point: AWS capacity was exhausted for our control plane instance type. This further delayed our response, as canceling a ‘terraform apply` can have strange knock-on effects with state and we didn’t want to run the risk of making things even worse. Eventually, the nodes launched, and we began trying to join them.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • OPA (Open Policy Agent)

    Open Policy Agent (OPA) is an open source, general-purpose policy engine.

  • At this point, someone spotted that we were getting a lot of timeouts in the API server logs for write operations. But not specifically on the writes themselves. Rather, it was timeouts calling the admission controllers on the cluster. Reddit utilizes several different admission controller webhooks. On this cluster in particular, the only admission controller we use that’s generalized to watch all resources is Open Policy Agent (OPA). Since it was down anyway, we took this opportunity to delete its webhook configurations. The timeouts disappeared instantly… But the cluster didn’t recover.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts