It’s amazing. It’s brand new. Everyone needs it. It is “the next best thing”.
Only half of these statements are true, it is not new and it is not the next best thing. It is branding an existing paradigm that many companies use and rely on—but—it is amazing and everyone should want it. If you’re new to the concept, what does observability really mean and how can it help transform your operations?
Investing in observability is about creating insight.
Collecting data from logs is great. Monitoring your resources—also very handy. But these items are only giving you one dimensional fragments of a complete picture. They are words or sentences of chapters in a story that is your environment. Once you assemble enough fragments and organize them in a way to glean actionable knowledge of your environment, you are creating that insight. Observability is about intersecting multiple items to generate a deep understanding of the true health, the real issues, and what should be changed to improve your environment.
Let’s dig into some of these components.
What Makes Observability?
Most will know what each of these items is, hereafter referred to as tenets, so let’s focus on what they mean for observability and what you should be thinking to reach a higher state of observability.
This defines “what” you want to watch. It is easy to overlook monitoring helpful items and even easier to overdo it by monitoring everything (more on this later). When building monitoring into each section of your operation, answer these questions:
- What should be watched? Only critical items is the wrong answer. So is everything.
- What tools can help simplify the process?
- How should you aggregate the data for simple processing?
- How long should data be kept? Time series data is helpful, but too much data can be wasteful and unnecessary. Not all data needs to be kept all the time.
This is a tenet that is simple to implement, but difficult to do well.
It not only defines where your servers and services place their execution and debugging information, it defines what you want to log, how you want it logged, and in most cases, how the logs can be transformed as they are shipped to an aggregation and/or searchable system.
Like monitoring, logging everything on a debug level is wasteful and going to create massive false positives, unimportant alarms, and increased difficulty in sifting through unimportant data. Not to mention, it will become expensive to manage.
Try answering these questions:
- Where should all logs ship to? Any log that contains important data should not stay on a server where it can be lost (and not analyzed).
- How much logging is appropriate for each service? Debug shouldn’t be the default unless you’re troubleshooting, or if important data is needed that only comes out when an application is set to debug. Those items should be reclassified.
- How long should logs be retained locally and in your shipped location?
- What tools can simplify this process? (ELK Stack)
- Do logs need to be transformed to be more useful and ingestible?
- Are logs being populated with the correct data? In other words, custom applications should place meaningful information on the state of the application and detailed errors. This really puts heavy onus on application developers to invest time in creating extensive error handling.
This defines what either happened or what is happening. This is an often overlooked but very important piece of proactive observability. Think through these items:
- What failures to a service are critical enough to catch and alert immediately?
- What code can be added to a service to provide better insight to execution behavior (debugging and anomaly hunting)?
- Is there visibility end to end for transactions? If not, fix that immediately.
This tenet is more about cross-sectioning multiple other tenets to create a deeper understanding of what is being captured and is often the heart of observability. Tools can assist you here, make sure to answer these questions:
- What do I want to see from my data and what is it telling me? Graphs are great, actionable plans from data are better.
- Are graphs showing trends? Even better, are they highlighting what is abnormal?
- What should be adjusted in your environment based on the analytics?
- Are applications experiencing bottlenecks?
- Is scaling not responsive enough?
- Is data too incomplete to make the correct decision?
- Is the data supporting bottom lines for budget, operations, customers, the business? It should.
This defines “how” to notify and “who” should be notified when an actionable event occurs. This is the easiest to get wrong and it is so important to get right. Make sure you cover:
- Do all of your alerts require attention? The ones that don’t remove them as alerts.
- Is there automation to resolve alerts? If not, the alerts should become tickets for DevOps to automate away the problem.
- Are alerts being tracked (analytics) for trending? They should be. This can easily justify devops automating a solution versus the hours spent resolving repeat alerts.
Implementing Actionable and Proactive Observability
It is easy to make these common mistakes:
- Monitor everything
- Alert on everything
- Store all logs and all data
- Use default graphs
What these mistakes lead to are inboxes full of ignored (and overwhelming) alert notifications. It overloads logging systems that become increasingly difficult to filter and sort for helpful information. It becomes very expensive to not only manage but maintain, especially for hosted solutions that often charge by data uploaded and storage. Worst of all, these common and often implemented mistakes causes two grievous issues.
Zero observability. What you have is a massive amount of completely unusable data.
Alert complacency. When your folks see 1000 unresolved alerts, they are more likely to just “acknowledge all” than chase down each alert. Especially if they see it every day.
What Can You Do to Avoid These Mistakes?
An important first step is to determine what kinds of information, metrics, and performance you want and start building to meet those needs. This doesn’t just mean ingesting logs and creating heatmap graphs, it also means adding code to the applications to be able to provide additional and insightful data to accompany heatmaps, alerts, etc. It also means running automated end-to-end tests frequently, and before and after deployments, to understand if anything unexpected has changed. It means iterating on the importance of good application behavior and improving each of the tenets as issues occur.
Second—and this cannot be stressed enough—the importance of “condensing the nonsense” alerting. If an alert isn’t actionable, get rid of it. If an alert occurs, determine if the issue is providing good information. By good we mean, does the alert provide the exact issue and potential fixes, is it being tracked and analyzed? Most importantly, can the issue which caused the alert be fixed, then automated? Lastly, when an issue occurs in your environment that didn’t cause an alert, if it can’t be automated away, an alert should be created.
Third, observability is not an ops problem. As an example, if the application is barely logging data, or produces a ton of noise, or it’s difficult to understand the error codes, it becomes a “garbage in-garbage out scenario.” Making this mistake can cause extended outages while ops teams try and sort through the garbage. Development and operations have to contribute to observability through the tenets otherwise you will achieve limited success and you may likely end up with an incomplete answer during a disaster. This is why observability is such a heavy discipline in DevOps.
Real World Scenarios
MongoDB and indexing, with ELK: If you are running Mongo Ops Manager, then you have a leg up in MongoDB observability, but for the sake of explanation, let’s discuss indexing. Let’s say we are pulling the underlying operating system metrics (CPU, memory, Disk I/O, network traffic) for the Mongo servers. This gives us general system health monitoring, but nothing on MongoDB performance. One of the metrics could be pegging (such as CPU). This means MongoDB performance will suffer, but why is it suffering?
Next, we look at the graphs in our monitoring systems and find that CPU tends to spike (now) between 3–4 p.m. each day. Okay, that’s better, it means it’s (likely) not an anomaly. We then move over to Kibana and run queries to observe Mongo activity (since we knew to pull the MongoDB logs using Logstash). We notice some queries are taking a very long time to complete and many of them are still coming in, causing a cascading resource problem. As we we find the slowest queries, we notice the amount of objects scanned versus returned is completely off—must be a query that doesn’t hit an index.
What did we get from this? First, it’s time to build an index to support the offending call. This is a reactive but necessary action. This is also where most people stop. Let’s go further, next, we create a dashboard tracking Mongo items scanned versus returned, then create a reasonable and acceptable threshold which will alert (such as anything over 50 items returned), containing: the collected metadata, what the query was, and the Mongo collection. That way when the next offense comes in, usable insight comes through as “Unindexed query has happened 15 times in 4 minutes, returning 3000 objects after 200,000 scanned on collection UserOrders. Suggest creating an index on EstDeliveryTime and id_Customer.” This is a much better alert than “CPU High on MongoDB, and a proactive step to potentially automating an index creation”
Latency heatmap with auto scaling: We have an ASG with the default “High CPU” for the scaling metric, but is that always the bottleneck? Also, does adding more servers to an auto scaling group create new upstream bottlenecks? It certainly can.
It’s just before Black Friday and one of your services is a server hosting a Rails application behind Nginx. It is starting to peg on the CPU because of runaway Unicorn processes. Unicorn memory bloating is causing the CPU to clog, which is forcing the ASG to scale. The scaling was set high in anticipation for Black Friday, so it launches 50 at a time. Now 50 servers check in to your configuration tool (like Chef or Puppet) and begin pulling configurations, but at that load, it is slow. The new configurations are taking 8 minutes to complete before the servers come online in HAProxy, instead of 3. This causes additional scaling, which aggravates the problem. Meanwhile, Haproxy is bursting in load and begins to bottleneck on CPU. Without continuing, it’s easy to see where this is going.
This time it’s a few weeks before Black Friday. We proactively create performance heatmaps to monitor resources and increase logging for a trial run. We simulate the aforementioned surge and observe performance. As we we see our heatmaps going red, we dig into understand what is pegging and determine if it is actually a problem causing the heat (runaway Unicorns) or the application is just running hot—then determine the correct way to scale. As scaling occurs, the bottleneck often moves further down the pipe, either toward the data (the back) or toward the endpoint (the front), and the exercise continues.
Now that you’ve resolved application issues and know your scaling is correct, your heatmaps are available to correctly determine load and your alerting can provide qualified notifications.
As you can see, observability isn’t just a hot buzzword, it is a way to mindfully gain insight from your environment by getting the most out of the underlying tenets. It is about shedding the noise and building applications and services that can provide targeted visibility. It is about analyzing the information provided, turning that data into system improvements and honing your visual aids into telling you what is happening rather than managing a system where you hunt for the answers. Observability is about designing your environment to provide actionable insight through meaningful data.