The Challenge of Log Management in Modern IT Environments

By: Daniel Berman

Gaining visibility into modern IT environments is a challenge that an increasing number of organizations are finding difficult to overcome.

Yes–the advent of cloud computing and virtually “unlimited” storage has made it much easier to solve some of the traditional challenges involved in gaining visibility. However, architecture has evolved into microservices, containers and scheduling infrastructure. Software stacks and the hardware supporting them are becoming more complex, creating additional and different challenges. These changes have directly impacted the complexity of logging and it takes a very specific set of tools and strategies to be able to solve this visibility challenge.

Understanding the challenge

Log data is one of the cornerstone requirements for overcoming this challenge. Yet nearly every application, appliance, and tool today generates an ever-increasing stream of log messages containing a wealth of information on what happened, when and why. These sources are often distributed on-premise, on the cloud or across different clouds.

In today’s world, old school methodologies are no longer viable:

Distributed systems, whether based on the traditional server/client model, or containers and cloud services, generate a huge amount of logs that are not only expensive to store, but also query.
Effective monitoring must be done in real time. If an application crashes, teams need to be able to be effectively alerted so as to perform a fast troubleshooting process.
Logs are generated by devices, applications, servers, and in different formats. If logs are simply shipped as flat files to sit on a file server, it takes significant effort to perform any kind of analysis.
Most importantly: cloud-based architecture requires efficient logging, alerts, automation, analysis tools, proactive monitoring, and reporting. Old school logging supports none of this.

Centralized logging concepts

Modern log management service must include log analysis capabilities and perform aggregation, processing, storage, and analysis. These components must be designed on top of cloud principles: high availability (HA), scalability, resiliency, and automation.

Aggregation – the ability to collect and ship logs from multiple data sources.
Processing – the ability to transform log messages into meaningful data for easier analysis.
Storage – the ability to store data for extended time periods to allow for monitoring, trend analysis, and security use cases.
Analysis – the ability to dissect the data by querying it and creating visualizations and dashboards on top of it.
Alerting – the ability to get notified when an event is taking place in real-time

Cloud principles for a log management solution

Now that we’ve covered the core elements comprising a modern log management system, we will cover the cloud principles that must be considered when trying to build a visibility solution into your environment. This will illustrate the challenges they bring to maintaining the solution.

High availability – building for High Availability (HA) means eliminating single points of failure and that failures cause no, or little to no, disruption of service. In cloud environments such as AWS, this means building a log management solution that is deployed on more than one availability zone and possibly more than one region and includes a data replication mechanism using a service such as S3.
Scalability – This is critical for most cloud services and the hardest to manage in log management solutions. This is because some services are either bound by storage, indexing, and/or clustering, which requires careful management behind the scenes. Other services, such as forwarders and log shippers are easily scaled to meet demands.
Upgrades – Upgrades are common activities in nearly every service. When you are talking several interconnected visibility components with clustering, storage, reindexing, etc., the effort becomes a large project in itself. Resilience – This principle refers not only to the storage layer (uptime and loss prevention) but also to avoiding service disruption. Where this differs from HA, is the ability for the architecture of your solution to continue to operate in the face of updates, migrations, zone failures, or the ability to recover data in the case of deletion/corruption.

Summing it up

It should be apparent that managing a full-scale solution for log management in modern software requires a considerable amount of planning. It doesn’t mean it can’t be done — of course, it can be done. It just means you need to plan carefully and make a knowledgeable decision.

For helping you decide whether to implement your own do-it-yourself log analytics solution or opt for a SaaS provider such as Logz.io, here is a checklist to help you get the most out of your operations.

Your log management solution must:

Support cloud-native log aggregation for: containers, container orchestration and cloud services.
Support a processing engine which is able to transform logs as necessary (ETL or grokking)
Support storage that is cloud-native or auto-replicated between nodes. Ideally, storage can be scaled automatically.
Support data retention that is configurable, preferably archivable utilizing cheaper storage tiers.
Support advanced analytics such as anomaly detection and machine learning.
Support integration into notification tools, such as Slack or Hipchat (sometimes called ChatOps), email.
Support custom metrics and dashboards and include pre-built dashboards and integrations with cloud services.