Centralized Log Management Best Practices and Tools

AWS Log Management

What is centralized log management? And why bother?

Centralized logging is a critical component of observability into modern infrastructure and applications. Without it, it can be difficult to diagnose problems and understand user journeys—leaving engineers blind to production incidents or interrupted customer experiences. 

Alternatively, when the right engineers can access the right log data at the right time, they can quickly gain a better understanding of how their services are performing and troubleshoot problems faster.

This is a core reason why we bother with centralized logging. Considering the complexity, scale, and interconnectedness of modern cloud environments, engineers can troubleshoot faster when all their data is together in one log management tool. 

If ‘decentralized’ logging means many different individuals or teams access their logs in various ways, ‘centralized’ logging is bringing all that data together.

Centralizing logging tools provide other benefits, like providing a single place to manage log data security policies, managing user access, standardizing parsing and other data processing, and making it easier for everyone on the team to speak the same language. 

In summary, centralized log management is the technology and process that unifies log data collection, processing, storage and analysis—which accelerates MTTR and simplifies logging administrative tasks.

Centralized log management pitfalls

Unfortunately, the act of centralizing logging into a single system can be incredibly heavy due to high data volumes.

In a recent survey conducted by Forrester, 77% of respondents said that data quality was a top challenge in observability, and 68% said that the large data volumes and their related costs was the top challenge in observability. 

All too often, engineers will turn on the fire hose of log data without realizing the consequences. When logs are sent without much thought into what information actually needs to be collected and who needs access to it, this leads to noisy data—logs that nobody really needs, but they’re being collected anyway. 

As a result, collecting and processing log data can become overwhelmingly expensive—whether it be bills from a logging vendor or cloud infrastructure to support your open source system. High volume logging can require burdensome data infrastructure, which can eat up engineering resources that would be better spent on driving the core business. 

Additionally, noisy log data can also lead to hours of searching across millions of events— hindering troubleshooting into customer-impacting production incidents. 

Modern cloud trends—consisting of growing compute footprints and splitting up systems into microservices—only point to more data and more logging challenges. 

We expand on these challenges in a recent webinar about AWS log management at scale—in summary, log management pitfalls can create painful costs, slow troubleshooting, and security issues.

 

In this blog, we’ll explore strategies for dealing with the unique challenges of centralized log management in the cloud, as well as common solutions for overcoming these challenges.

Best practices for centralizing log management

To avoid the outcomes described in the previous section, let’s explore some best practices to ensure logging is as cost efficient, time efficient, secure, and flexible as possible:

Organize your log data across teams

Centralized logging is all about bringing your data together in one unified platform, so you can  manage all the data in one place. However, providing a single, giant bucket of log data for every team member can create new problems.

  • Role-based access control (RBAC): A key requirement for standards like PCI, HIPAA, or SOC2 is controlling user access to data, so that sensitive information in log data can’t be accessed by anyone in the organization.
  • Troubleshooting: When an engineering organization centralizes all their logs in a single environment, engineers may need to search through billions of logs to find the relevant data – which could be a single log line. It’s much more efficient if every team can only access data that’s really relevant to them. 

While organizing your data requires a bit more work upfront, you can reduce security and troubleshooting headaches down the road by organizing your logs into specific environments for each team. 

Look for a logging-as-a-service solution that allows admins to centralize the data in one place to simplify management, while also segregating user access to the data to ensure RBAC.

Focus on the logs that matter, while cutting costs and noise

Continuous data cleaning and optimization is a critical, yet often neglected best practice for cloud log management. Modern cloud environments generate huge quantities of log data from many different sources. The result can be enormous and painful costs.

Luckily, many engineering and IT orgs are indexing WAY more logs than needed—providing an opportunity for cost reduction. Log management is often criminally overpriced considering the amount of data that actually gets used. 

The solution? Continuously look through your log data and filter out the junk you don’t need. 

It sounds easy, but if it was easy, everyone would do it! Filtering out useless log data requires manually combing through your logs, and adding filters to your log collection technology by editing their configuration. 

There are log management tools that are especially good at this, and others that don’t offer that capability (which conveniently align with their business model centered around data volumes), as we’ll discuss later.

But what if you don’t know exactly what log collection technology is shipping useless log data? How can you be sure it’s useless? What if you don’t have access to the log collection technology?

It’s easiest to identify and filter out unneeded data when you have a single UI to 1) inventory all of your incoming data so you can identify what you don’t need, and 2) utilize dynamic filters alongside the data inventory to remove unneeded data.

Another way to reduce cloud logging costs is to use a hot-warm-cold storage architecture to keep infrequently-used logs in lower cost storage. The tradeoff is usually slower performance, but there are ways around this.

Try machine learning to accelerate log analytics

Every year Logz.io runs the DevOps Pulse Survey. In 2022, 64% percent of DevOps Pulse survey respondents reported that their MTTR during production incidents was over an hour,

compared to only 47% in the previous year’s report. Huge log volumes generated by the cloud certainly don’t make troubleshooting faster.

In theory, troubleshooting cloud environments with log analytics should be a ripe use case for machine learning. Logging in the cloud requires analyzing through millions, or even billions, of events in real time—a task that seems best suited for a machine. 

Machine learning can help automatically surface the critical information, rather than requiring manual search. However, like many machine learning use cases with big promises, it’s fair to be skeptical. 

Many log management tools offer unsupervised machine learning that highlights “anomalies” in log data by comparing current volume trends against previous trends. This strategy is prone to false positives because volume volatility is in the very nature of logging in the cloud. 

But other applications of machine learning for log analysis can work well.

One example is with log pattern recognition, which clusters similar logs into groups, making it easy to quickly scan the data and accelerate troubleshooting.

Another example is Logz.io’s Exceptions and Insights, which automatically surfaces critical exception and error logs that were viewed as urgent by other engineers and IT professionals.

Bonus best practice: If you haven’t parsed your log data, there is no hope for productive log analysis. Plain and simple. Parsing gives logs the structure they need to be searchable, but sadly, parsing is a unintuitive and annoying process. 

There are plenty of resources to learn about log parsing. If you really don’t want to learn it (I can’t blame you), you can check out Logz.io’s parsing service

Avoid vendor lock-In, which is especially painful for log management

Simply put, migrating across log management solutions is often time-consuming, tedious, and frustrating work.

Especially if you’re looking to transition from proprietary log collection to another logging vendor or log management tool, it’ll require ripping and replacing the agent that collects that data. 

Proprietary tools like Datadog, Splunk, and New Relic all have their own agents that, shockingly, only support log collection for their respective system. If you ever need to switch systems— which happens all the time for reasons like cost, rotating leadership, and feature functionality differences – ripping and replacing these agents can be painful.

In addition to log collection, your team will need to build new monitoring dashboards, log parsing rules, alerts, and other observability features. 

Sometimes, the cost to do all of this is so high that it’s not even worth transitioning to a different tool, even if your team would be better off for it. That’s why vendor lock-in for log management is so pervasive: switching vendors is expensive and time-consuming.

One way to beat vendor lock-in is with open source data collection. Technologies such as Fluentd and Fluent Bit are interoperable with many other log management back ends.

If you want to switch, all you have to do is change the configuration, rather than ripping and replacing everything. 

When it comes to dashboards, there are a few solutions out there that support OpenSearch Dashboards, Like AWS OpenSearch Service, Logz.io, and Aiven. But usually, you’ll need to build new dashboards as you switch log management solutions.

Unify log analytics with metric and trace data

Centralized logging is critical, but when it comes to full visibility into your cloud environment, it only shows part of the picture. Metric and trace data provide additional context to provide a more holistic view of your infrastructure and microservices.

When all the data is together in a single platform, you can correlate the signals from each telemetry data type to troubleshoot production issues faster.

For example: if you’re monitoring your metrics and you identify a potential production issue, you’ll need to understand what’s causing it. Rather than going into your log management tool and starting the query from scratch, you can correlate your metric data with the associated logs (or traces). This can only be done by collecting all of your telemetry in the same place.

Similarly, you can investigate production issues quickly by correlating metric and log data with the associated trace data. 

Distributed tracing takes a little bit more work to set up because applications must be instrumented to send the trace data, but it can significantly accelerate troubleshooting by clearly isolating the performance bottleneck in complex microservice architectures.

Using point solutions for log, metric, and trace analytics not only prevents data correlation, it also increases tool sprawl in a world that desperately needs consolidation (as in, the world of cloud, DevOps, and ITOps).

Look for a unified observability platform to centralize log, metric, and trace data on a single platform.

Implementing these best practices requires careful consideration for log management tools. Different logging technologies offer strengths and weaknesses, which you’ll need to align with your priorities for best practices.

Let’s review some of the most popular components for log management, which can be (over)simplified to log collection, processing, storage, and analysis technologies. 

Fluentd and FluentBit 

The first step of implementing a log management stack is log collection, so we’ll start with some of the most popular log collection tools out there: Fluentd and Fluent Bit.

As mentioned earlier, Fluentd and FluentBit are most widely known as popular open source Kubernetes monitoring technologies, but they also collect data from other sources.

These technologies act as a unified layer to aggregate data from multiple sources, standardize the differently formatted data into JSON objects, and route it to different output destinations.

Performance, scalability, and reliability are some of these technologies’ great design strengths. For example, a typical Fluentd deployment would use ~40MB of memory and is capable of processing above 10,000 events per second.

To take it a step further, FluentBit leaves an even lighter computing footprint running on ~450KB only! 

FluentBit is also capable of collecting metric data – making it an easy and unified option for those collecting both telemetry types. If you’re also using Prometheus, FluentBit can export metrics in Prometheus-format to unify the data in the same data store and dashboard.

The ELK Stack and Filebeat

The ELK Stack is one of the most popular centralized log management solutions – used to collect, process, store and analyze log data. 

While ELK used to be open source, Elastic decided to change the software license to a non-open source license in January of 2021. As a result, AWS launched OpenSearch, as we’ll discuss in the next section.

The stack consists of Elasticsearch for log storage, Logstash for data processing, and Kibana for log analysis—however, Logstash isn’t often used in new deployments, which is usually replaced by lightweight alternatives like Fluentd. 

In addition to Fluentd, ELK users collect their data with the Beats family, which is a group of data collection agents (the most popular being Filebeat) that send data to the ELK Stack from many supported sources.

The ELK Stack is easy to implement on a small cluster, but as data volumes grow, it can become a considerable effort to manage. Users will need to handle upgrading, managing clusters, sharding indices, parsing logs, performance tuning, and other tasks themselves. 

If these tasks aren’t executed properly and timely, the result can be slow queries, dropped data, or even a crashed stack.

While Elastic has built a service to simplify ELK management, the customer is still on the hook to monitor and troubleshoot problems in the ELK data pipeline.

OpenSearch and OpenSearch Dashboards

Once Elastic closed sourced the ELK Stack, AWS launched OpenSearch and OpenSearch Dashboards—which are forked versions of Elasticsearch and Kibana, respectively. With the help of AWS engineering firepower, these technologies have quickly grown into popular open source solutions for centralized log management. 

Obviously, OpenSearch and OpenSearch Dashboards started off as very similar to Elasticsearch and Kibana, but they have been slowly growing apart as the AWS-led community takes OpenSearch in a different direction.

Most notably, there are a variety of features that are only available with the premium (paid) subscription for Elasticsearch and Kibana, but are free with OpenSearch. 

These include:

  • Advanced security features like encryption, authentication, access control, and audit logging and compliance.
  • Machine learning-powered log analysis to identify trends and surface anomalies in the data, which can help accelerate troubleshooting.
  • Centralized user management/access control. 

OpenSearch is an excellent solution—but it’s not perfect. Running it yourself, especially at larger scales, requires similar maintenance tasks to ELK that can eat up engineering resources. Services like Logz.io provide OpenSearch-as-a-service, making OpenSearch logging a seamless experience.

Splunk

Splunk basically invented centralized log management. They were founded in 2003 and have been a logging pioneer ever since. They’re widely known for their advanced analytics, ML, and their pervasive adoption—there are job titles named after them at this point: Splunk engineers!

The most common drawback you’ll hear is the cost. Splunk is notoriously expensive—at least partially due to their limited data filtering capabilities. Much of the data you pay for may never be needed. 

Identifying and removing unneeded data requires manually searching through your logs to understand what you need, and then going back to reconfigure your Splunk Universal Forwarder (what they call their data collection agent) to filter out the data at the agent level.

Since this is not easy or intuitive, it often never gets done.

Logz.io

Logz.io provides an open source observability platform by unifying tools like OpenSearch for logs, Prometheus for metrics, and OpenTelemetry and Jaeger for tracing. We’ve built all kinds of capabilities around these open source technologies to simplify observability.

This includes Logz.io’s log management tool, which makes it easy to implement the best practices described above:

  • Organize your data: segregate data across different Logz.io sub accounts by team, and assign user access to the data accordingly to simplify RBAC.
  • Reduce data volumes and costs: With the Data Hub, get a single UI to inventory all your incoming data and filter out the unneeded information to reduce costs.
  • Machine learning to accelerate log analytics: Logz.io deploys ML technologies like Log Patterns and Insights to automatically highlight the critical log data in your data set.
  • Avoid Vendor Lock-in: Logz.io is based on the most popular open source monitoring technologies—making it easy to move to and from the platform.
  • Unify logs with metrics and traces: Get a single place to search and correlate all your telemetry data to accelerate MTTR.

Aligning log management tools with best practices

The best log management technologies for your team will depend on how you prioritize best practices for centralized log management. 

Is data security especially important? Then perhaps open source logging tools like OpenSearch aren’t the best option. Alternatively, you may prefer a proprietary solution that supports security features like SSO, RBAC, data masking, and PCI compliance.

Is tech stack flexibility especially important? If so, open source solutions are probably your best choice.

These choices deserve careful consideration, because the stakes can be high and have a big impact on your observability strategy. Hastily assembled log management tools and processes can create: 

  • Slow troubleshooting and MTTR: the signal to noise ratio in centralized logging can be a huge challenge. When the critical troubleshooting data is hidden in a sea of useless information, it can be harder to find the signal—this can be the difference between a minor production issue and a serious incident. 
  • Perpetually high costs: Log data and costs go hand in hand, so why pay for data you don’t need? And why pay the same for data you use all the time, compared to data you rarely need? Effective data filtering and flexible storage options can better align the value of log data with costs.
  • Security and compliance risks: Log data often contains sensitive information, so if any user can go into any tool to access the data, that could violate compliance or security requirements. 

With the right combination of technologies and processes for centralized log management, you can overcome these pitfalls and ensure widely adopted, cost efficient, and effective logging across your organization. 

As discussed, Logz.io makes it easy to implement these best practices. If you’re interested in trying it out for yourself, try our free trial.

Get started for free

Completely free for 14 days, no strings attached.