A Guide to Software Resilience Testing!
Resilience testing belongs to the category of “non-functional testing” and tests how an application behaves under stress. Due to increasing consumer demands, resilience testing is as important as never before.
What is software resilience testing?
Software testing, in general, involves many different techniques and methodologies to test every aspect of the software regarding functionality, performance, and bugs.
Resilience testing, in particular, is a crucial step in ensuring applications perform well in real-life conditions. It is part of the non-functional sector of software testing that also includes compliance testing, endurance testing, load testing, recovery testing, and others. As the term indicates, resilience in software describes its ability to withstand stress and other challenging factors to continue performing its core functions and avoid loss of data.
Since you can never ensure a 100% rate of avoiding failure for software, you should provide functions for recovery from disruptions in your software. By implementing fail-safe capacities, it is possible to largely avoid data loss in case of crashes and to restore the application to the last working state before the crash with minimal impact on the user.
One way of improving the resilience of software and solutions is by hosting them on cloud servers, thus minimizing the chance of failures to the internal system and choosing a much more resilient cloud architecture. While disruptions do occur on the cloud level as well, the cloud operators usually have sophisticated resilience and recovery systems in place.
What are some examples of how software resilience testing is done?
Even though all of the Netflix services are hosted on Amazon Web Services’ state of the art cloud servers with cutting edge hardware, the company realized that the sheer scale of their operations makes failures unavoidable.
To prepare for these failures, Netflix developed its own tool to create random disruptions to the system and tested it for resilience. The tool was designed to simulate “unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables ” Netflix can then build automated recovery mechanisms to deal with them should they occur again in the future.
The tool is run while Netflix continues to operate its services, although in a controlled environment and in ideal time frames. By only running Chaos Monkey during US business hours on weekdays, the company ensures that their engineers will have the maximum capacity for dealing with the disruptions and that server loads are minimal compared to peak consumer usage times.
After early successes, Netflix quickly developed additional tools to test other kinds of failures and conditions. Among these tools were Latency Monkey, Conformity Monkey, Doctor Monkey, and others, collectively known as the Netflix Simian Army. Resilience testing with the Simian Army has since become a popular approach for many companies
With consumer expectations increasing, it is vital to ensure minimal disruptions to any service or software that enters the market these days. While cloud hosting can go a long way in minimizing failures, resilience testing should still make up a significant part of overall software testing.