Chaos engineering 101 & Best practices for chaos testing

Read Symflower's introduction to chaos engineering and our best practices for chaos testing

In this blog post, we dive into the topic of chaos engineering for testing the resilience of software products.

Our series about the software testing jungle introduced the basic categories of testing. Within each of these major categories, there are several ways to carry out testing, leading to a wide variety of testing types.

In a previous post, we have covered mutation testing, a less-often-used form of error-based testing that aims to evaluate the thoroughness of your test suite. This article covers a resilience testing technique called chaos testing.

What is chaos testing, and how is it different to regular software testing?

The goal of chaos testing (or chaos engineering) is to introduce unexpected or random failures in a controlled environment to test how dependable the system under test is. Used by DevOps and IT teams, chaos testing aims to assess a system’s resilience to minimize outages and increase the system’s uptime/availability.

Unlike traditional testing, chaos testing also covers third-party factors e.g. disruptions and behaviors caused by elements that are not part of the system. Another difference to regular testing is that chaos testing, in theory, is done on your production system (although it’s worth noting that a 2021 survey found that only 34% of companies actually did chaos testing on their production servers).

An excerpt from Gremlin's 2021 State of Chaos Engineering report — Image source: https://www.gremlin.com/state-of-chaos-engineering/2021/

Chaos testing simulates a range of failure conditions such as network disruptions, unexpected peaks of traffic, malfunctioning servers, or unanticipated user actions. When doing chaos testing, you will inject such random failures and faults into your system. Then, you will observe the system’s behavior and responses to help prevent failures and, ultimately, to reduce downtime. The goal is to enable the system to tolerate and recover from failures and disruptions to make sure it reliably maintains the intended functionality.

The history of chaos testing

Chaos testing is often said to have started with Netflix, long considered a pioneer in chaos engineering. But actually, the roots of this discipline go as far back as the 1980s when an Apple programmer developed Monkey, an application that created random UI events. The intention was to simulate a monkey banging at the keyboard and randomly clicking around with the mouse.

Some more modern-day applications of chaos engineering included Amazon’s GameDay and Google’s similar solution called DiRT (Disaster Recovery Testing). But with the creation of their chaos engineering tool Chaos Monkey in 2011, Netflix became something of a leader in chaos testing. Netflix has applications like Chaos Monkey running on their systems all the time to introduce breakdowns to the same production environment that actual Netflix users use. The creators of Chaos Monkey wanted to remind developers that breakdowns were inevitable and to force them to build resilience into their code. That’s still a key aim of chaos testing.

What is the goal of chaos testing?

Chaos engineering is used by global companies like Netflix, Amazon, Google, and the like to increase system uptime. Chaos testing is especially important for applications where high uptime is required. That’s what the widely quoted “5-nines availability” refers to, e.g. when the goal is to reach 99.999% uptime for the system in question.

Compared to performance or load testing, chaos testing takes things a step further as it has developers prepare for unanticipated, random events (rather than specified failure scenarios only). With chaos testing, preparing for these incidents is an obligation, since you will intentionally “unleash chaos” in your systems. Therefore, such failures are facts of life rather than just events that could potentially occur but you hope they won’t. That’s why proponents of chaos testing claim that practicing this technique helps make robustness a number one goal for developers.

Benefits of chaos testing

As spelled out above, chaos testing forces development teams to apply efficient coding practices that lead to high resiliency, and to adopt better solutions for monitoring application performance. Better systems reliability means less expensive downtime. High uptime, in turn, can improve user experience and increase customer loyalty.

For the organization, practicing chaos testing means that teams are better equipped to deal with failures, and the company will have better incident response and disaster recovery plans in place. It should also be easier to achieve compliance with regulatory requirements, industry standards, or service-level agreements. Overall, practicing chaos engineering promises to increase the confidence of all stakeholders in the system’s robustness. However, there are some challenges…

Challenges of chaos testing

Since it’s done on production servers, chaos engineering has the potential to impact production and cause unnecessary damage if the “blast radius” for chaos testing is not controlled sufficiently.

Another problem is that if insufficient observability is provided, any issues uncovered will be difficult to interpret & fix. Especially in the case of complex systems, it can be hard to understand interactions to discover how a fault occurs.

It’s worth noting that chaos testing is a resource-intensive process that necessitates the use of specialized tools and automation. Finally, it’s also important to remember that there can be a cultural difficulty in adopting chaos testing. Some teams may find it hard to accept that they have to intentionally cause disruption in a production environment, which goes against the instincts of anyone working with software systems.

How to do chaos testing: chaos engineering best practices

There are a few key best practices to consider when starting out with chaos testing.

To avoid unnecessary damage, pick the target system carefully and control your chaos testing activities accordingly. Remember that you want to stay in control of the chaos you’re unleashing on your production systems – careful planning is vital. Set up a well-defined hypothesis to guide your chaos testing efforts. Define the expected outcomes, and set a goal for the chaos testing experiment.

It’s also crucial to ensure adequate visibility so that you can observe and understand any issues uncovered. Record data like performance metrics, error rates, response times, and whatever other parameters make sense in your environment to really dig deep into the problems that chaos testing calls your attention to.

All that data will come in handy when you’re setting out to document and analyze the results. Compare observed behavior with the expected results you defined in the beginning. Once improvements are made, document them accurately and share the knowledge internally so that other team members can benefit.

Summary: chaos testing for software resilience & robustness

Performed with caution and the right level of planning, chaos testing is a powerful tool to ensure the robustness of the systems you are developing. By making chaos testing a principle in your team, you can encourage fellow developers to focus on resilience from the get-go.

Just like the focus on resilience, thorough testing should also start from the minute you start writing code. That’s why we’ve built Symflower, a handy IDE plugin that generates unit test templates for your Java, Spring, and Spring Boot applications. Using Symflower, you can save yourself the hassle and time it takes to write unit test boilerplates – better still, there’s a beta feature that generates complete unit test suites for your code! Automatic test maintenance and test-backed code diagnostics take things a step further to boost your productivity. Try Symflower free!

| 2023-10-31