The online world is full of contrasts. On the one hand, you have site reliability engineers whose job is to keep the business running by ensuring an app’s smooth operations. On the other hand, you have the DevOps staff, whose goal is to minimize cycle time—the time from business idea to feature in production. These two teams can have conflicting objectives.
These teams work together smoothly as long as cycle time is minimized. But, problems tend to arise when outages happen, and teams often live in fear of their occurence. After an outage, when the dust settles, you can do a post mortem and call it a marvelous learning opportunity; however, this won’t change the fact that there was significant downtime and your cycle time went off the charts. In the words of The Phoenix Project, “The only thing that stops planned work is unplanned work.”
No one knows what the next incident will be until it happens. Firefighters often intentionally start controlled fires to ensure that a real fires can be contained. You should do this too, by taking proactive measures to ensure that your system can handle faults, manage underlying infrastructure breakdowns (like hardware malfunctions), and gracefully work around service outages. This craft is called Chaos Engineering, and this post is dedicated to presenting some of its main principles.
The goal of chaos engineering is to create an antifragile system—one that thrives upon fault. An antifragile system doesn’t simply resist malfunction and remain unchanged in harsh conditions; it improves with each failure event. The question is, “How can you achieve this condition when code, once written, is static, packaged, and deployed to an immutable infrastructure?” Can such a rigid structure adapt? Yes, it can—if you consider the system as a symbiotic relationship between a man and a machine. Engineers can draw conclusions from failures and implement remedies.
Should I Break Production?
Does chaos engineering mean that every SRE professional should get trigger-happy around a cloud provider’s console and keep pushing the “terminate” button? Absolutely not. “Not breaking production” remains a key phrase in the SRE job description.
Saying, “I am going to deliberately inject fault into our system,” will definitely capture the attention of your business. Without a proper explanation for the reasoning behind your initiative, your attempts to improve the product will grind to a halt.
Chaos experiments must be carefully explained and planned, and they must be diligently executed. Randomly deleting cloud resources without adequate preparation won’t provide meaningful insights into the system, and it might cause an unplanned outage. You’ll want to use the proper tools and automation in your experiments. They will help you reproduce your experiments, should the need to do so arise.
The Prerequisites for Chaos Engineering
Without proper monitoring, you cannot fully understand the effects of the fault you’ve injected into the system. And without certain system maturity and basic High Availability (HA) in place, any fault injection can bring the whole system down. Martin Fowler wrote a great bliki entry about the supporting infrastructure for a microservices-based system that is worth checking out.
There’s no point in doing chaos engineering if your system doesn’t have redundancy or failover mechanisms. Without even touching the command line, you know that their lack presents a weak link. A common practice among startups is adding high availability without even considering the effect of associated tradeoffs (such as eventual consistency or conflict resolution with regards to concurrent database writes). If application developers are not taking into account the possibility of such events or the possibility of a failure of components, these issues should be addressed first. A frequently overlooked component of a system is the network that connects all the other components together. Remember, it’s also prone to failure!
On the observability side, logging will allow root cause analysis. Tracing will point out which services may require additional attention, whereas extensive monitoring can provide a general overview of the system health and alert you in case problems arise. You will know which pieces are missing from your infrastructure when chaos ensues—hopefully they’re the ones you targeted.
Four Golden Signals
“Site Reliability Engineering” by Google mentions that an excellent starting point when developing a monitoring system is tracking latency, traffic, errors, and saturation. By closely watching these metrics, you can observe most of the effects of your experiments. You can search for correlations that allow you to better understand how components interact. For example, if you start emitting errors from one service and the entire system latency goes through the roof, then you might need to be more conservative with regards to timeouts or you might consider implementing some circuit breakers.
On the other hand, you can tell if your system is scalable by introducing load and carefully watching saturation and errors. If the saturation reaches a critical value and then drops without so much as a spike on the errors chart, you know that autoscaling was triggered and that the additional load was distributed across the new instances. Of course, if the saturation sustains at peak level and the error rate rises, it could be an indication that the system isn’t scaling fast enough.
Let’s Get Down to Business
The ability to detect complex failures requires far more advanced monitoring that is based on measurable business goals. This goes beyond the basics—key services’ latency may be exceptional, but it only matters if there are actual people buying your product.
Humans make mistakes and tend to miscommunicate. App requirements change or aren’t clear from the start. As a result, there is potential for convoluted bugs in every layer of your system. Imagine that, after running an experiment, the system appears to be fine, but the amount of orders plummets. This could be caused by an interference between a legacy service and a brand new one that prevents the client from completing the checkout, for example. Tracking business objectives will help you catch this early and react accordingly.
Unless you do proper shadow deployments, you’ll be running your chaos engineering experiments in production. While that’s a perfectly fine thing to do, it can get pricey and complex at a certain scale, and it provides little added value in an early stage startup.
Making production your playground is an acceptable practice as long as there is a rollback plan in place and the fault injection can be controlled and contained if necessary. During chaos engineering adoption, you may even consider having people on standby during the experiments. You should also consult business stakeholders to make sure the time window in which you conduct your experiments doesn’t conflict with an important demo or other event. In addition, you should be extra careful with databases. If you persist a malicious payload and the cleaning scripts fail silently, it may be very difficult for your system to successfully recover. Last, but not least, you should keep an eye on security at all times. Introducing vulnerabilities with the experiments is not a good idea.
Monkeys in Your Datacenter
We’ve explored the “why” and “how” of chaos engineering. Now it’s time to examine the “what” by looking at some reliable tools to aid you with your experiments.
The most well known of these is Chaos Monkey from Netflix, which randomly terminates AWS instances, along with its “big brother,” Chaos Gorilla, which takes down whole datacenters. While this is a powerful tool, please bear in mind that your experiments should represent a credible threat to your system. Do you need to be able to operate flawlessly when entire region goes out?
Commonly used tools like Jaeger can be repurposed for chaos experiments. You can use the span baggage to pass fault injection instructions across the service mesh. To achieve ultimate reusability, you can try using pieces of your system against itself. How will a custom queue-based system behave if an additional (potentially misconfigured) scheduler starts competing for the workers?
Having read this article, you should have a good understanding of the benefits and challenges of chaos engineering. Do you still think you need it? If so, are you ready for it? If you answered “yes” to both of those questions, check out the tools described above or try writing your own. You are the one who knows best what you want to protect against!