Chaos Engineering

Approaching Chaos

How to think about starting chaos engineering

Attempting chaos engineering in an organization for the first time can be a challenge. Let’s assume that all the right people are bought into the idea, what next? There are two wildly different approaches you can take. First is finding ways to automate chaos experiments, using tools like chaos monkey to inject faults1, or an entire platform dedicated to it like gremlin2. Alternatively, they can be performed manually. Both have their appeals—manual experiments can add exponentially more value in the right scenarios. Chaos engineering lends engineers to think about technical systems and their resilience being evaluated by observable and quantifiable metrics, but often, the incident response and socio-technical attributes of doing it are left out in fully automated systems.

The value of full-automation is baking reliability into products and services—it’s mandatory when anything can fail at anytime. This is enforced by having randomized sets of experiments run to inject faults into the system. An interesting finding might ping the active on-call and they might get some exposure to this failure mode but all the responsibility of dealing with the incident likely falls onto a single individual, perhaps discussed during a post-mortem analysis of the event in a way that is actionable, but maybe not. Perhaps this is one factor in ensuring reliability concerns become culture norms, the other part is making sure that the engineers writing the software have incentives to fix issues. One approach to doing this is sharing pagers with an incident team.

The core strength of manual runs is knowledge sharing. Putting people in the same room with various degrees of expertise during different stages of an experiment will make engineers stronger and more knowledgeable about the actual behavior of the systems. This allows people to update their mental models, tune their telemetry, and update incident response plans in preparation to the experiment and its follow-up. Gathering the necessary information to answer questions gives people a chance to discover more things about their system and how to work with it, which is an important training activity. Since the activity is being done as part of an experiment it becomes work rather than an exercise and has a larger chance of growing an engineer because there is no answer book, they’ll have to figure it out, which is a skill worth building. The follow-up that happens after experiments also teases out things that can be improved from both the technical and socio-technical systems being analyzed. How did someone respond to an incident? Who was paged? What is the narrative of our telemetry? All of these things get actively discussed in manual runs, which are often left out of automated runs. They give you insight into the ability of engineers and the culture that surrounds you. They also give you a chance to organically build incident response skills in people that may never have responded to an incident before. In real incidents cortisol levels run high, sleep deprivation kicks in, people perform worse. Manual runs can help people learn how to operate with slightly elevated cortisol levels and help internalize the triage and response process in that particular scenario, critically, when the stakes are lower.

Both approaches have their appeals but a lot of emphasis is placed on the former when the latter is a better place to start in most cases, and arguably, silly to remove in whole. The ideal is taking into account both approaches, where you’re able to instill resilience as a value and bake it into the culture while still being able to learn and share knowledge by running manual experiments. Being able to automate some portions of the manual process is worth it because facilitating experiments is high effort—reducing the manual burden of experimental setup is a win all around. Keeping the conversations for never before run experiments in the design, run, and analysis phases is important for learning. Engineers get to update their mental models on the systems in question, a lot of ideas about how to improve both social and technical systems come out of these discussions. Automating previously run experiments might be a good approach, it will make sure critical issues get addressed while still validating the hypothesis remains true over time, and low-stakes modifications can happen with minimal loss in knowledge transfer in some cases; it also instils the values of a resilience culture by forcing its reality via responding to incidents. However, new experiments should be run manually because you get the best learning and sharing from it. This advice is from my lived experience running experiments but if you’re interested in learning more, the book3 on the topic elaborates in far more detail. It informed my approach and will likely remain true for a long while. As a final word of caution, do not start running chaos experiments for teams that are already at capacity for fixing critical issues. Throwing more work on their plate is a way to ensure that even less gets done. Making sure the proper incentives are in place for this type of work, is the right way to move forward.

What are those incentives? First, engineers responsible need to feel the pain of their incidents or reliability will bleed into specific organizations but not others making different incidents likely to reoccur. Second, doing the work to improve services needs to be recognized. Empowering people to make meaningful improvements is hard if it goes unnoticed. Improve the visibility of a project by creating a narrative about why it matters. People respond to stories, give them one! Another approach is quantifying risk. Sharing:

Removing this failure mode from service x will reduce our chance of catastrophic failure by 47.2% thus saving $9,400,000 over the course of 3 years.

This type of analysis gives you leverage and makes a case for particular projects analyzed; when completed, they have a dollar amount attached to them stating that this much money for the business was saved. Compared to product engineering where the business case is obvious, it becomes clear why certain types of engineers get promoted more often than others. It’s hard to justify this work to people who are skeptical about it to begin with and this is a way to remove that skepticism or at least have an answer to it. Also, the formal prioritization of specific types of projects is enabled due to the measurement of impact. If you don’t need this to make a case, awesome! If you struggle in an organization to do this type of work, try it! Maybe you learn that the things you care about are not as big of a deal as you thought, or are much worse than anyone could have known; either way, it gets you in touch with things the business values which is a win all around.


  1. Chaos Monkey a tool developed by Netflix that injects faults into cloud services
  2. Gremlin an entire platform dedicated to incident analysis and the automation of chaos experiments
  3. Nora Jones and Casey Rosenthal’s Chaos Engineering Book