In the log management practices of the past, data and applications were generally located within centralized hardware IT systems such as IBM mainframe or large, purpose-built servers.
To monitor application health and ensure that systems were running smoothly, administrators used legacy monitoring systems such as HP OpenView or CA Unicenter. If an event occurred, the monitoring system would alert the system administrator, who would then be required to log into the system, read the log file, analyze it, decide what had actually happened and where, and then take the necessary action to remediate the issue.
In this article, I’ll quickly review the evolution of log management in recent years, describe our own personal experiences analyzing our logs and metrics and share my six main considerations for log analytics in the cloud.
“IT organizations suddenly gain access to thousands of servers running more applications than ever that also wind up generating more log files than most IT organizations can possibly cope with…In order to be successful in the cloud, IT organizations need to relentlessly measure how both systems and applications are performing. The problem is that log collection and processing is painful.”
— Amazon.com CTO Werner Vogles at the AWS NY 2015 Summit
The Tiered Web Application Stack
Along with the accelerated adoption of web and cloud technologies over the past few years, the way that applications are architected has also started to change. Instead of having a few servers that monopolized application infrastructure stacks, companies now have dozens, hundreds, or even thousands of servers that each play a smaller role within the stack. Multiple web servers, database servers, and logic-tiered servers were required, with each web-server serving a specific domain that sits on a different virtual server.
This change was deemed to be a huge leap for IT in general and the agile realm in particular. It could also be considered a wake-up call for operations because it was now required to aggregate all logs into a central repository to manage such an environment. This brought about the first wave of log aggregation companies such as Splunk, HP ArcSight, and a few others that provided the ability to aggregate logs from a range of different data sources into a single location.
The Trio: Public Cloud, Containers and Microservices
When examining the state of affairs over the last few years, it is clear that advancements in modern IT have actually complicated matters. The introduction of the public cloud, Docker containers, and Microservices—the trio of modern IT components—has made log management exponentially harder. Each process, no matter how small, has its own Docker container with its own log files, clarifying that the need for centralization and control in log analysis. The ability to gain insights from multiple sources of log files has become a must for all IT organizations, regardless of their sizes.
Despite moving to the cloud and hearing advice from Vogels and others, many IT organizations are still not using log analytics effectively and are therefore unable to attain optimal agile development and delivery in a DevOps environment. Here are our 6 to-dos that we’ve learned (sometimes the hard way) on log management:
- Aggregate ALL Logs
It’s necessary to capture as much data as possible and then ship it to your log analytics system. That way, you’ll have a database of valuable information that you can access and analyze at any time. Due to the high granularity and complexities of modern systems, you might find that the root cause of an issue has to do with logs that were never captured. Internally, at Logz.io, we failed to do this and learned that lesson the hard way. You can read about our experience here. You’re never going to regret sending logs, I promise you.
- Keep Applications Safe
Log analysis should never damage the application it helps monitor. That’s a log management rule of thumb. Make sure that the resources on the application side are not being overwhelmed. There are a couple of ways to go about that. First, use a solid, stable agent to ship the logs — Rsyslog, NXLog, and Logstash Forwarder are a few good ones. Then, make sure they are well configured as there are parameters that control the frequency of log collection and distribution, which can impact server-resource utilization.
- System Scalability
As your organization or the demand for your application grows, so will the amount of log data. It is critical to make sure that your log analytics and management capabilities can scale accordingly. It’s fairly easy to get started with a small server to process logs, but you need to make sure it will be up and running when you need it most (such as whenever bursts of logs occur). When your system has a problem, it will generate much more data than usual, which can consequently break your log analytics system in times of need.
- Integrate with DevOps Processes
So you’ve aggregated your logs and they are all in one place with a system implemented to serve your data. What’s next? One of the main benefits of applying log analytics to agile R&D organizations is the ability to integrate it fully into their release processes. Integrating with the SDLC (Software Development Life Cycle), CI (Continuous Integration), and CD (Continuous Delivery) are fundamental steps towards success in log analytics. Effective log analytics enables good agile development, testing, and deployment processes because developers and QA teams can easily see how their code operates in production as well as which alerts are generated by their code and so on.
At Logz.io, our development teams have coined the term “LDD” – Log-Driven Development. Every feature that goes to production must have a dashboard and a set of alerts defined for it to make sure it’s completely transparent to the developers who implemented it and to the DevOps teams who support it. Otherwise, it can’t go to production. What’s nice about it is that developers assume full ownership of their work removing the gap between the developer and DevOps engineer. With ownership comes responsibility, accountability and recognition.
- Ask the Right Questions
Where most organizations fail is when they don’t properly utilize the log information that they’ve acquired. How many times did you chase an issue only to find out that it was in the logs all the time but you were just not looking at the right place?
Let’s say that you’re monitoring 40 or 50 different types of logs. What exactly should you be looking for? What questions should you be asking your data? The key to having a successful log analytics implementation is to have a large library of actionable insights that you can derive from your log files with the ELK Stack. At Logz.io, we are focusing on this a lot internally, and I hope we will be able to share more in the future.
The 6th Tip: Use Open Source
Open-source solutions are excellent from a vendor lock-in perspective because they allow companies to participate in the growing communities that are surrounding open-source products. Learn from others’ experiences and use any relevant extensions that the community provides. Today, more than ever, you can leverage open-source packages to accelerate development—and building your log management solution is no different.