A Guide to Enterprise Observability Strategy

By: Charlie Klein

Strategy Success Criteria
Top Observability Challenges
Best Practices and Strategies
Unified data
Data Segregation
Data and Storage Optimization
Data Quality
SaaS vs self-hosted
Avoid Vendor Lock-in
Overcome Technical Roadblocks
Who Owns Observability?
Implementing your Strategy

Observability is a critical step for digital transformation and cloud journeys. Any enterprise building applications and delivering them to customers is on the hook to keep those applications running smoothly to ensure seamless digital experiences.

To gain visibility into a system’s health and performance, there is no real alternative to observability. The stakes are high for getting observability right — poor digital experiences can damage reputations and prevent revenue generation.

Unfortunately, ensuring effective and efficient observability only gets harder with scale. As cloud workloads grow and produce huge data volumes, observability systems can become burdensome cost centers that eat up engineering resources.

Sprawling cloud environments can also jeopardize security, engineer happiness, and the central mission of observability itself — improving systems through real-time data analytics.

These are some common pitfalls of enterprise observability. This guide will describe best practices and strategies to avoid these outcomes while implementing cost-efficient observability that is enthusiastically adopted across engineering teams.

Observability Strategy Success Criteria

Before jumping into best practices, let’s first define what success looks like. Which outcomes would a successful enterprise observability strategy produce?

More resilient services: This is why many of us bother with observability in the first place. Observability should empower developers to deliver more resilient services that ultimately provide better digital experiences for customers.

Happy engineers and widely-adopted observability: Like any other developer tool set, slow, clunky, and unintuitive observability stacks will thwart adoption. Poorly adopted observability products can curb engineer morale and prolong production incidents that negatively impact customer experiences.

Time efficiency: Engineering resources are expensive, and there are many ways observability can waste them. We’ll explore how to overcome common engineering time-sucks below.

Cost efficiency: Observability is a brutal cost center for many enterprises. Systems that handle huge observability data volumes can incur costs in different ways, including vendor bills, infrastructure costs, and man hours.

Priorities will vary across enterprises, but we continuously hear about these desired outcomes at Logz.io. Let’s take a look at some of the challenges that often prevent or hinder these outcomes.

Top Enterprise Observability Challenges

Scaling observability across multiple teams, services, and clouds can create all kinds of challenges.

Open source observability deployments can easily become overwhelmed by growing data volumes — requiring time and effort to remain standing. Plus, they lack critical security controls like encryption, SSO, and user access management.

To save engineering resources and secure their data, many enterprises adopt SaaS platforms to remove the burden of managing their own data infrastructure. But this requires every engineer to learn completely new interfaces, configurations, and best practices. Proprietary observability tools like Datadog and New Relic are also infamous for high costs and vendor lock-in.

Many things can go wrong in the world of observability, so we asked a few hundred engineers (with help from Forrester) about the top challenges of implementing observability:

Data quality: 77% say poor data quality is at least somewhat challenging. Unparsed log data, for example, is very difficult to interpret — making production events much harder to investigate. Metric data can also be unknowingly manipulated by visualization tools — presenting an inaccurate picture of the current state of your services.

Number of tools: 71% say large numbers of tools are at least somewhat challenging. Investigating issues is longer and more complex when critical monitoring and troubleshooting data are collected in different places.

User access visibility and control: 70% say visibility / access control across teams is at least somewhat challenging. It can be near impossible for observability administrators to understand who can access which data when there are many users and tool sets — which can lead to security and compliance liabilities.

Data volume: 68% say large data volumes and cost escalation are at least somewhat challenging. Observability data volumes are exploding for many organizations — raising costs and obscuring the essential data within enterprise observability environments.

The consequences of these challenges can impact the adoption, efficiency, and security of enterprise observability. When asked about the impact the above observability challenges, the same Forrester respondents answered:

MTTR impact: 37% said “lack of ability to identify incidents before they impact customers” was a key negative impact of observability challenges. High data volumes and poor data quality can make it difficult to draw valuable insights.

Value: 39% said “high costs for unimpressive results/basic functionality.” Large data volumes quickly drive up costs for observability tools, which may not be equipped to handle the load.

Manual work: 37% said “significant amounts of manual effort is required to draw useful insights.” Similar to the MTTR impact, it’s hard to find useful results in proliferating and poor quality data.

Time: 37% said “less time to spend on strategic/value-add activities” — managing observability tools. Especially self-hosted tools like open source technologies can eat up valuable engineering resources needed for other initiatives.

These are mission critical problems for the modern enterprise. MTTR affects the speed to resolve customer-facing services, while manual engineering work can distract engineers from focusing on providing differentiating value for their core business.

Let’s take a look at overcoming these common pitfalls.

Best practices and strategies for enterprise observability

Below are strategies and best practices that can prevent typical enterprise observability challenges, while realizing ideal observability outcomes.

Unified data analytics to reduce MTTR

Centralizing monitoring results, sharing data, standardizing configurations, and providing a consistent platform across teams are all key reasons for standardizing observability tool sets across teams.

Additionally, unified data analytics allows users to correlate across datasets to quickly pinpoint the root cause of the problem.

For example, if an engineer is being alerted on suspicious metrics — such as spiking resource saturation in their infrastructure — they would need to quickly find the relevant log or trace data that could explain the cause of the problem. If they’re using separate tool sets for the data, they would need to open up the relevant interface to begin searching for the data from scratch.

If the data is in one place, they can quickly correlate the spiking metrics with the associated logs and/or traces. This would immediately show the logs or traces from the same time period, generated from the same service, as the suspicious metrics.

Now they have the relevant information they need to isolate the root cause of the problem.

With Logz.io, engineers can quickly correlate across their telemetry data to explore production issues in context and reduce MTTR.

Segregate data across teams to manage user permissions

This one might seem contradictory to the previous section — didn’t we just discuss the advantages of unifying data in one place!?

Centralizing logs, metrics, and traces can clearly reduce MTTR. That said, new challenges arise when all of the data is together:

Security and compliance liabilities: Oftentimes, telemetry data can contain sensitive customer information. Enterprises in regulated industries cannot allow anybody in their company to access this data — this poses potential violations to compliance requirements like SOC2 and PCI-DSS. Observability admins need to control who can access specific data.

Cluttered observability environments: When hundreds of users are analyzing their data in a single observability workspace, it will be filled with noisy and irrelevant information. Every user would need to search through large amounts of data, dashboards, and alerts to find the information relevant to them.

This poses a bit of a paradox. Unifying data and users on a single platform clearly simplifies observability processes, improves MTTR, and makes it easier to share data across users. But it also creates new challenges described above — which just adds complexity.

Enterprise observability teams need to find technologies to centralize their data on a single tool set, while segregating data across teams to prevent the outcomes described above.

Segregating data can be done with self-hosted tools like the ELK Stack by simply creating multiple clusters for separate teams, but this requires extra overhead to manage multiple clusters. This can also be done with SaaS platforms like Datadog by provisioning separate instances for different teams.

However, in both of these scenarios, the observability administrator lacks top-down management to control user access (and data volumes and costs) across teams in a single place.

With Logz.io, observability admins can unify all their data and users in one platform under a Primary Account, and then assign users to specific data that live in segregated Sub Accounts. This provides a single place to manage costs and user permissions, along with the simplicity and MTTR benefits of unified telemetry data analytics.

Reduce costs with data and storage optimization

As telemetry data volumes explode, the cost of ingesting, processing, and retaining so much information can become overwhelming. For many enterprise IT organizations, observability is their second highest budget item — second only to cloud costs.

Many of these costs are unnecessary, because huge volumes of telemetry data are usually junk. So, why aren’t teams removing this junk data to reduce costs? Because it’s not easy to do.

Removing junk telemetry data generally requires two steps:

Identify the unneeded data. To do this, engineers usually need to manually comb through their data and speak with their peers to determine what is needed, and what isn’t.

Filtering out the data. This usually means reconfiguring each data collection component — such as Fluentd, a New Relic agent, Prometheus, OpenTelemetry, and other data collection components. This can be a pain if there are hundreds of components, especially if it isn’t obvious which component is collecting the specific data you want to filter out.

These steps aren’t easy or fast, so they often don’t get done. The result is perpetually high observability costs. Check out this podcast to learn why today’s environments produce so much data, and what you can do about it.

That’s why Logz.io built features like Data Optimization Hub — which provides a single UI to inventory all incoming data, highlight which data isn’t being used, and provide filters to remove the unneeded data.

As a result, Logz.io users usually remove 30-50% of their total data volumes, which they were paying for before Logz.io..

Logz.io also helps reduce storage costs for the data you do want to keep — including LogMetrics which converts heavy log data into time-series metrics, as well as Archive & Restore which removes the need to index logs used for compliance.

Implement data quality processes to derive actionable insights from your data

Observability data can be noisy, overwhelming, and confusing. Some data provides helpful and actionable insights, while other data is useless.

Observability practitioners can improve the overall quality of their data by filtering out the useless information, while enriching the data they keep. They can also add data processing components in their pipeline to transform confusing information into actionable insights.

Encouraging good observability data hygiene is an easy thing to suggest. But if it was easy, everyone would do it. There is a reason why ensuring high quality data is a pervasive problem in the monitoring and observability world: it’s unintuitive!

Below are practices that improve data quality, but are difficult to implement:

Log parsing is a key practice to improve log data quality, which structures logs into fields that are easy to search and visualize. Unparsed logs are essentially useless. Parsing log data requires complicated parsing languages like grok, which can take hours of debugging to implement.

Data filtering can remove unneeded information that clutter essential insights. However, as described in the previous section, filters usually have to be configured for every data collection component in production, which can require reconfiguring hundreds of components.

Machine learning can highlight critical information that could otherwise get lost in a sea of data. But this can require implementing solutions that add complexity, like adding a new cluster that runs machine learning jobs in an Elastic deployment.

Data quality standards can be challenging to enforce in an enterprise observability strategy, but the alternative is paying for useless data that engineers can’t use to understand the current state of their system.

For those less experienced with best practices for telemetry data enrichment, there are technologies and expertise available to simplify things.

For example: Logz.io provides three options to parse your log data depending on your experience and desire to customize your own parsing:

Automatic parsing for popular log types (like nginx, AWS services, and Kafka) — parsing rules for these technologies are built into Logz.io.
Parsing-as-a-service only requires a message through the support chat to get your logs parsed for you.
For those that want to customize parsing rules themselves, they can use Logz.io’s self-service log parser.

Logz.io also provides a single place to inventory and filter out data, making it easy to identify and remove data you don’t need. Rather than reconfiguring every data collection component, Logz.io users can filter their data within a single, simple UI.

Finally, Logz.io provides machine learning that can enrich data and highlight the most critical insights. The Exceptions feature automatically flags critical exceptions and surfaces threats to help users separate the most important information from the noise.

SaaS vs self-hosted

The decision to use a SaaS platform or run your own observability data infrastructure is about the resources available on your team.

Thousands of organizations run self hosted open source observability tools (like OpenSearch, Prometheus, and OpenTelemetry) to monitor and troubleshoot their system. Open source solutions are a great way to avoid vendor lock-in, they easily integrate into cloud native environments, and they provide endless innovation from the engineering community. Learn about the leading open source observability tool sets here.

Alternatively, thousands of organizations also choose SaaS solutions so they don’t have to dedicate engineering resources to managing the observability data infrastructure themselves.

Engineering talent is a valuable resource — SaaS observability allows engineering leaders to focus their top talent on their core business, rather than maintaining their observability data infrastructure.

Examples of observability data infrastructure maintenance includes, but is not limited to:

Installing databases, ingestion, data collection, and other components on your own clusters
Adding and maintaining data queuing like Kafka if large data volumes start to overwhelm your clusters
Upgrading these components periodically to patch vulnerabilities and leverage new features
Troubleshooting the entire pipeline to fix performance issues that can delay querying – or worse, crash your data infrastructure altogether
Shard indexes to prevent overwhelmed databases from delaying queries

If you want to allocate engineering resources to other priorities, just let somebody else manage the observability stack for you.

Companies like Logz.io, Datadog, or SumoLogic provide excellent SaaS platforms that handle the entire data pipeline and clusters for you. So all you need to do is send your data, log into your account, and begin analyzing.

Avoid vendor lock-in, which is especially painful for enterprise observability teams

The decision to choose a SaaS or self-hosted solution requires an unfortunate trade-off between vendor lock-in and time efficiency — both of which can throw a wrench in enterprise observability strategy.

While choosing open source observability tools prevents vendor lock-in, it requires expensive engineering resources to maintain. On the other hand, SaaS solutions can eliminate maintenance requirements, but they can lock enterprises into contracts that don’t suit their long-term goals or cost requirements.

Proprietary observability is especially prone to vendor lock-in because of all the investment it requires to implement or migrate. Setting up integrations, dashboards, and alerts — while teaching teams how to use them — can take months.

To avoid this trade-off between ease-of-use and vendor lock-in, Logz.io delivers the most popular and interoperable open source observability tool sets in the world on a fully-managed SaaS platform.

Migrating users to and from Logz.io is easy — users can keep the open source integrations and dashboards they already have in place. And since it’s all delivered via SaaS, Logz.io eliminates the maintenance requirements needed to keep the data infrastructure up and running.

Equip every engineer to overcome technical roadblocks

In addition to managing the observability tool set, enterprise observability teams often need to support end users. This can require direct assistance for building dashboards, instrumenting services, parsing logs, building alerts, and navigating the interface.

Effective support ensures end users can quickly and effectively operate the observability system to identify and troubleshoot problems. Without it, simple problems like a misconfigured dashboard can prevent engineers from obtaining the critical insights they need to gain visibility into their services.

A lot can go wrong with an observability system, so if it’s used by a hundred engineers, support can become a tedious full time job.

This is why Logz.io has invested so heavily in our Support Team. Rather than being solely responsible for the successful adoption of your observability system, Logz.io provides direct assistance at no extra cost. Any Logz.io user can reach out through the Support Chat within the app to speak with a Customer Support Engineer to quickly overcome technical obstacles.

Our engineers can assist with any request and our average response time is 40 seconds!

How do enterprises assign observability ownership?

The best practices described above are great in theory, but who is going to implement all of them?

Observability responsibilities — which might include user access control, other security policies, data infrastructure performance, user support, and other tasks depending on the organization — are increasingly being rolled up to a dedicated team.

This insight is the result of continuous developer and DevOps community surveys that Logz.io runs to better understand enterprise observability trends, challenges, and strategies —- including the DevOps Pulse Survey and the aforementioned Forrester Research survey. Together, we received responses from 1300+ observability practitioners.

We also gather qualitative feedback. Logz.io’s Principal Developer Advocate Dotan Horovits spoke with Slack’s Technology Lead Suman Karumuri on who owns observability. The recording can be found below.

Let’s call this model — where one team manages observability tool sets and processes for the rest of the organization — the Observability Shared Services Model.

Observability Shared Services Teams call themselves different names. According to the DevOps Pulse Survey, most organizations have tasked DevOps teams with observability administration (34%), while others spread the work across developers (28%), IT Operations (20%), as well as SREs (17%).

Regardless of their titles, a vast majority of respondents —– 85% —– said their organizations operate using an Observability Shared Services Model.

According to the same survey, these were the most-cited benefits to adopting a Shared Services Model:

One platform: 47% said providing a consistent platform across individual teams. Unifying data collection, analysis, data parsing, and other observability processes reduces the amount of tool sets Shared Services Teams needs to support.

Data sharing: 45% said sharing relevant data across multiple teams. When everyone is looking at the same types of data and formats, it’s easier to share information and use a common language.

Centralized monitoring results: 44% cited the ability to roll up monitoring results centrally. When all the data is together, practitioners can quickly correlate across different signals to get the full picture of production faster, and troubleshoot problems sooner.

Cost efficiency: 37% cited cost efficiency as a benefit of consolidated observability ownership. It’s easier to identify opportunities to reduce costs when all the costs are measured together. Plus, centralizing observability under one team can streamline inefficient processes created by incompatible tool sets.

Security and compliance: 24% said the ability to enforce guardrails for compliance — it’s much easier to enforce compliance requirements with a single set of controls to monitor things like user access to data.

Shared Services teams can implement tool sets and best practices to make observability more efficient, effective, and secure across an organization. For many enterprises, these are the teams on the hook to implement best practices to make observability work for hundreds of engineers.

Implement a strategy to make observability effective, efficient, and widely adopted

Enterprise observability systems need to reliably handle huge volumes of data every day for near real-time analysis. This can be a costly, complex, and burdensome task, but it’s essential to get it right.

An efficient and performant observability system can be the difference between a minor issue that is quickly resolved, and a major production event with widespread customer impact. For enterprises that rely on their customers’ digital experience for revenue, observability directly impacts the bottom line.

As discussed, much can go wrong in an observability system — delaying troubleshooting, driving up costs, wasting engineering resources, and posing security risks. These are critical challenges for the modern enterprise.

The best practices and strategies in this article can help overcome these risks, and Logz.io can make them easier to implement.

If you’re interested in trying it out yourself, sign up for our free trial.
To get in touch with an expert about putting these best practices in place at large scales, a product demo may be more helpful.