One of the biggest KPIs in the DevOps space is monitoring. There are so many tools to help any organization to complete their monitoring picture, but no tool does everything and most organizations use many tools to help complete their monitoring solution. Mashing tools together often creates a problem of its own — the tool sprawl problem.
In modern computing, it’s not how much data you collect and report, or how efficient, or how durable your monitoring solution is. Sure, those are all important considerations, but it’s how effective and useful your monitoring is that makes the difference. It’s how much value to the business it creates, and how well the data can be exploited to identify and resolve critical issues. Monitoring is never a completed effort.
It evolves. It is enhanced by tools and by integrations. Often enough, the journey to improve monitoring is what creates and accentuates the tool sprawl problem. In this article, I’d like to examine how monitoring tool sprawl can become a serious issue for modern, engineering-driven companies.
The task of monitoring modern IT environments is too complex to properly handle without tools. The days of allowing logs to sit on servers and fishing through them to find answers are long gone. Alerting on an operating system issue and manually clearing out all the noise from old vendor solutions for sysadmins (think HP, Dell, IBM) no longer scales in the world of cloud computing.
Luckily, there are plenty of modern tools to solve modern issues. But like any type of software, every monitoring tool has weaknesses and strengths in their own right. Organizations will often patch together multiple monitoring tools based on their strengths and just deal with the sprawl.
So what are the modern problems to solve and tools to solve them?
Log data is considered an extremely valuable data source for monitoring and troubleshooting both applications and the infrastructure they are installed on. Most log management tools on the market provide analysis capabilities. Some provide advanced analytics such as machine learning and anomaly detection. Most of these tools now include plugins and integrations with cloud vendors to provide greater insight into cloud-based applications.
The world’s leading open source log management tool is, of course, the ELK Stack — an extremely popular and powerful platform but one that often requires more engineering effort and expertise to scale.
Metrics, or time-series data, is another type of telemetry data used for monitoring. Used primarily for APM (Application Performance Monitoring), ITIM (IT Infrastructure Monitoring) and NPM (Network Performance Monitoring), metrics introduce another kind of challenge being more verbose in nature and requiring more elaborate data storage and retention strategies as well as analysis features.
Open source solutions are often comprised of a time series database such as Prometheus, InfluxDB or Graphite with Grafana playing the role of the analysis and visualization layer. Plenty of SaaS vendors offer their own APM and monitoring solutions, including premade dashboards for monitoring specific services or platforms.
The increase in cyber threats means organizations must operate with security in mind. A big part of security is active monitoring and reactive controls. Triggering alarms on root or administrator login is an example, or signaling a Puppet run when a security-controlled configuration is changed via an automated response to a security incident. To be able to build this kind of solution requires a very specific kind of tool, usually falling under the category of SIEM or Security Analytics. Again, there are both open source and proprietary solutions on the market but the skills gap is proving to be as big a challenge as integrating and deploying these solutions.
SOC, PCI, HIPAA, SOX, GDPR, ISO, and CODA are just a few regulatory and compliance certifications companies must contend with to remain in business. All of them require some level of auditable data to show that their required checks and controls are being maintained. This means companies must find tools to capture, store, and retrieve data for compliance. Some tools excel at configuring controls or capturing security data but aren’t as strong at capturing application logs and transforming them into formats that mesh well with security logs to have an overlay picture.
Again, most tools provide canned reports, most also allow you to build your own reports. The key difference is some provider’s reports will be more relevant to an organization than others. An example of where the tool sprawl can become real is an organization with a security team that prefers the tailored security event reports from Alertlogic, an operations team that uses Datadog’s metrics for capacity planning and the developers use the ELK Stack to determine API performance issues. All three tools can create all three reports, but they do not specialize in providing all three. This key difference is what creates a tool sprawl challenge, in this case for reporting and alerting.
Multiple solutions mean what?
After reading the previous section, it is easy to see how companies choose multiple tools and vendors to solve their monitoring needs. In the following section, I’d like to examine some of issues that can result from having multiple monitoring solutions.
Multiple panes of glass
Having security data flow to one tool, systems performance data to another, and application data to a third makes correlation much more difficult. Even if you are able to have data sources feed multiple frontend tools, it still requires additional “stitching” to deliver the data in a meaningful way and the systems still present information differently. This can force the need to build translation jobs between solutions, or lengthy exports and manual correlation in spreadsheets. Nobody wants to do that.
Administration (and cost) is heavier
This means managing permissions through RBAC, customization of data feed sources, plug-in management, and supporting infrastructure must be considered. The resources and cost burden can become extremely heavy pretty quickly when designing for scale, high availability, and storage.
Every agent deployment, server component, data source, and tool configuration requires automation effort. It doesn’t matter whether you use a desired state configuration tool like Puppet, or an orchestrator like Ansible, or even custom scripts to configure your monitoring solution, there will be additional automation required per tool. Each automation will have its own tests, development, versioning, upgrades, and deployment lifecycle. This directly impacts the overhead of your engineering team.
Languages and APIs
Some tools output in JSON, others require transformation into usable formats through regex, grokking, or custom sed/awk style changes. Regardless, each tool has its own flavor of language and way of modeling data to be ingested by downstream components. This includes API calls, which can programmatically publish or pull data to and from the monitoring tools. In fact, having multiple tools that can’t share a data set sometimes require API calls to pull data from one source to another.
To Build or to Buy
We all love using open source software.
It’s free, there’s a community driving the project, and it allows us to avoid vendor lock-in and develop a set of skills we can take from one job to another. Building a monitoring system on top of open source is a choice many organizations make. There are some fantastic open source monitoring projects on the market.
But as a business grows, some monitoring tools begin to become a burden. Engineers end up spending more time maintaining these systems instead of the applications they are developing. When building open source monitoring systems, engineering teams should address and consider the following issues:
- Design architecture with enough storage for growth, data retention, clustering considerations, performance tuning, failover strategy, or high availability.
- How difficult is upgrading? How difficult is it to scale up storage and performance with automation or without it?
- What alerts will be created at a minimum, and how they will send notifications?
- What metrics should be collected, how should they be collected, and for how long?
- What log analysis is required and can the tool perform the analysis out of the box, or is there required coding?
- What reporting is required? Can the tool support dashboards and reports that you need? Will reports require additional development efforts?
Teams that are resource starved may be able to answer all of the above questions but not spend adequate time to implement everything required to have a fully mature monitoring solution(s). If your organization has the time and cycles to spend building an MVP or mature solution on open source tools, building is the right option.
Conversely, an engineering team may have less time to dedicate to building the solution and need more “out of the box” functionality. Specifically, the “designing architecture” considerations can be a large effort, especially for enterprises or customers dealing with a tremendous amount of systems and data. In the case of infrastructure, scaling, storage, and performance concerns, SaaS vendors solve many of these concerns as part of their offering. An additional benefit to SaaS tools is most vendors have cost models that grow with your monitoring needs. So instead of accounting for the scale and cost of built tools, this cost consideration is already built into the SaaS model.
There are a multitude of monitoring solutions on the market, both open source, and proprietary. There are many reasons why an organization will choose multiple tools to complete their monitoring picture, but in doing so, the sprawl creates additional challenges to overcome as detailed above. Weigh the pros and cons carefully before implementing your monitoring strategy.
Logz.io’s goal is to empower engineers to be more effective by providing them with the open source monitoring, troubleshooting and security tools they want to use, with the scalability, add-on features and availability required for monitoring modern IT environments. Offering a unified machine data analytics platform based on the ELK Stack and Grafana, Logz.io seeks to help solve the tool sprawl challenge.
More about this goal and our vision in a future article!