Key Considerations for a Cross-Organizational Observability Strategy

Organizational Observability Strategy

Logz.io ran two surveys earlier this year to better understand current trends, challenges, and strategies for implementing more effective and efficient observability – including the DevOps Pulse Survey and a survey we ran with Forrester Research. Together, we received responses from 1300+ DevOps and IT Ops practitioners on observability challenges, opportunities, and ownership strategies.

Additionally, Logz.io’s Principal Developer Advocate Dotan Horovits spoke with Slack’s Technology Lead Suman Karumuri on who owns observability in an enterprise. The recording can be found below.

One of the most interesting trends identified in these sessions and quantitative analyses is the rise of the Observability Shared Services Model. This model entrusts observability responsibilities with a single team, who manages observability tool sets and processes for the rest of the organization – including SREs, DevOps, IT Ops, and/or developers – hence the name of this article, ‘Cross-organizational observability’.

These Observability Shared Services Teams call themselves different names. According to the DevOps Pulse Survey, most organizations have tasked DevOps teams with observability management (34%), while others are developers (28%), IT Operations (20%), as well as SREs (17%).

Despite their titles, a vast majority of respondents – 85% – said their organizations operate using an Observability Shared Services Model. 

According to the same survey, these were the most-cited benefits to moving to a Shared Services Model:

  • One platform: 47% said providing a consistent platform across individual teams. Unifying data collection, analysis, data parsing, and other observability processes reduces the amount of tool sets Shared Services Teams needs to support.
  • Data Sharing: 45% said sharing relevant data across multiple teams. When everyone is looking at the same types of data and formats, it’s easier to share information and use a common language.
  • Centralized monitoring results: 44% cited the ability to roll up monitoring results centrally. When all the data is together, practitioners can quickly correlate across different signals to get the full picture of production faster, and troubleshoot problems sooner.
  • Cost efficiency: 37% cited cost efficiency across the organization. It’s easier to identify opportunities to reduce costs when all the costs are measured together. Plus, centralizing observability under one team can streamline inefficient processes created by incompatible tool sets. 
  • Security and compliance: 24% said the ability to enforce guardrails for compliance – it’s much easier to enforce compliance requirements with a single set of controls to monitor things like user access to data. 

Shared Services teams can implement tool sets and best practices to make observability more efficient, effective, and secure across an organization. 

That being said, organizing and implementing observability for many separate teams with unique requirements can present new challenges. 

Key Challenges of Managing Observability Tool Sets and Processes for Multiple Teams

As cloud workloads grow larger and more complex, managing observability stacks can become increasingly expensive, burdensome, and potentially insecure. When asked about the top challenges of implementing observability, respondents from the Forrester survey replied:

  • Data quality: 77% say poor data quality is at least somewhat challenging. Unparsed log data, for example, is very difficult to interpret – making production events much harder to investigate.
  • Number of tools: 71% say large numbers of tools are at least somewhat challenging. Investigating issues is longer and more complex when critical monitoring and troubleshooting data are collected in different places.
  • User access visibility and control: 70% say visibility / access control across teams is at least somewhat challenging. It can be near impossible for observability administrators to understand who can access which data when there are many users and tool sets – which can lead to security and compliance liabilities.
  • Data volume: 68% say large data volumes and cost escalation are at least somewhat challenging. Observability data volumes are exploding for many organizations – raising costs and obscuring the essential data within observability environments.

According to the same Forrester respondents, the consequences of these challenges can impact the effectiveness, efficiency, and security of cross-organizational observability, including:

  • MTTR Impact: 37% said “Lack of ability to identify incidents before they impact customers”. High data volumes and poor data quality can make it difficult to draw valuable insights from the data.
  • Value: 39% said “High costs for unimpressive results/basic functionality”. Large data volumes quickly drive up costs for observability tools, which may not be equipped to handle the load.
  • Manual Work: 37% said “Significant amounts of manual effort is required to draw useful insights”. Similar to the MTTR impact, it’s hard to find useful results in proliferating and poor quality data.
  • Time: 37% said “less time to spend on strategic/value-add activities” – managing observability tools. Especially self hosted tools like open source technologies – can eat up valuable engineering resources needed for other initiatives.

How Shared Services Teams can implement more effective and efficient cross-organizational observability

Below are strategies and best practices Shared Services teams can implement to overcome the challenges described above.

Unified data in one place, but segregated across teams

Centralizing monitoring results, sharing data, and providing a consistent platform across teams are all key reasons why Observability Shared Services teams are standardizing observability tool sets across teams. It’s much easier to manage a single tool set than many.

That said, unifying all your telemetry data in one place can create new kinds of challenges, including:

  1. Security and compliance liabilities: Oftentimes, telemetry data can contain sensitive customer information. Organizations in regulated industries cannot allow just anybody in their company to access this data – which poses a violation to compliance requirements like SOC2 and PCI. Observability admins need to control who can access specific data.
  1. Cluttered observability environments: when tens or hundreds of users are analyzing huge data volumes in a single observability workspace, this can lead to a noisy and messy environment. Every user would need to search through a large amount of irrelevant data, dashboards, and alerts to find what they’re looking for. 

This poses a bit of a paradox. Unifying data and users on a single platform clearly simplifies observability processes, improves MTTR, and makes it easier to share data across users. But it also creates new challenges described above – which just adds complexity.

Observability Shared Services leaders need to find technologies and strategies to centralize their data on a single tool set, while segregating data across teams to prevent the outcomes described above.

This can be done with self-hosted tools like the ELK Stack by simply creating multiple clusters for separate teams, but this requires extra overhead to manage multiple clusters. This can also be done with SaaS platforms like Datadog by provisioning separate instances for different teams, but it lacks top-down management to help admins control costs and user access. 

With Logz.io, observability admins can unify all their data and users in one platform under a Primary Account, and then assign users to specific data that live in Sub Accounts. This provides a single place to manage costs and user permissions.

Data enrichment and optimization

Observability data can be noisy, overwhelming, and confusing. As cloud workloads grow, they generate huge volumes of data – some provide helpful and actionable insights, while other data is useless.

Observability practitioners can improve the overall quality of their data by filtering out the useless information, while enriching the data they keep. ‘Enriching’ data can take the form of log parsing, which structures logs into fields that are easy to search and visualize; or, data can be enriched with rules or alerts that highlight the most critical information. 

Encouraging good observability data hygiene is an easy thing to suggest. But frankly, if it was easy, everyone would do it. There is a reason why it’s a pervasive problem in the monitoring and observability world: it’s unintuitive!

Parsing log data requires complicated parsing languages like grok, which can oftentimes take hours of debugging to get it right. Data filtering is also hard because it has to be configured for every data collection component in production, which can sometimes be hundreds. Plus, it’s not always easy to tell what you need, and what can be filtered out.

For those less experienced with parsing and filtering data, there are technologies and expertise available to simplify things. 

For example, Logz.io makes it easy to identify and filter out unneeded data. This video shows how you can use pattern recognition to spot useless information, and use Drop Filter to discard data on the fly. 

To enrich data, Logz.io provides automatic parsing for popular log types (like nginx, AWS services, and Kafka) as well as parsing-as-a-service – which is as simple as reaching out to our support team to get your logs parsed through our Support Chat. Plus, Logz.io automatically flags critical exceptions and surfaces threats to help users focus on the most important information.

SaaS vs self-hosted 

Companies like Netflix, Facebook, and Google run their own observability systems to gain insights into their services and ensure reliability. Hundreds of thousands of other organizations run self hosted tools like Splunk or open source monitoring tools (like Opensearch, Prometheus, and OpenTelemetry) to monitor and troubleshoot their system.

Running self-hosted systems require dedicated engineering resources to keep them running smoothly. The user is on the hook to identify and resolve performance issues that could impact data ingestion and query speed – or worse, to resolve issues that cause the system to drop data or crash altogether.

Many teams are happy to dedicate the engineering resources to a self-hosted monitoring stack. Those who use Splunk get excellent analysis capabilities, while open source users get a free tool set that they can customize to their preferences. 

However, many engineering organizations are feeling the strain of limited talent and resources – and there is an opportunity cost of managing a self-hosted stack. Engineers working on the performance and stability (and other tasks like upgrading or implementing data queuing) of self-hosted tool sets could be working on other critical projects.

If you want to allocate engineering resources to other priorities, just let somebody else manage the observability stack for you. Companies like Logz.io, Datadog, or SumoLogic provide excellent SaaS platforms that handle the entire data pipeline and clusters for you. So all you need to do is send you data, log into your account, and begin analyzing.

Build out a scalable support system

In addition to managing the observability tool set, Observability Shared Services teams often need to support end users. This can require instructions or direct assistance for building dashboards, instrumenting services, building alerts, and navigating the interface.

Ensuring effective support ensures end users can quickly and effectively operate the observability system to identify and troubleshoot problems. However, if your observability system is used by a hundred engineers, support can become a tedious full time job. 

This is why Logz.io has invested so heavily in our Support Team. Rather than being solely responsible for the successful adoption of your observability system, any Logz.io user can reach out through the Support Chat within the app to speak with a Customer Support Engineer. 

Our engineers can assist with any request and our average response time is 40 seconds!

Implementing effective and widely adopted observability

Observability systems need to reliably handle huge volumes of data every day for near real-time analysis. This can be a costly, complex, and burdensome task, but it’s essential to get it right. 

An efficient and performant observability system can be the difference between a minor issue that is quickly resolved, and a major production event with widespread customer impact. It can also be the difference between thousands of dollars in savings.

Implementing observability with some of these best practices can help ensure more effective and efficient monitoring and troubleshooting. As described above, Logz.io can make these best practices reality.

If you’re interested in trying it out yourself, sign up for our free trial here.
To get in touch with an expert about putting these best practices in place at large scales, a product demo may be more helpful.

Observability at scale, powered by open source

Internal

Logz.io Live. Join the weekly live demo.
2022 Gartner® Magic Quadrant for Application Performance Monitoring and Observability
Forrester Observability Snapshot.

Centralize Server Monitoring With Logz.io

See Plans