Enterprise Observability: Who Owns It?

By: Dotan Horovits

It’s common sense. When a logstorm hits, you don’t want to be left scrambling to find the one engineer from each team in your organization that actually understands the logging system – then spending even more time mapping the logging format of each team with the formats of every other team, all before you can begin to respond to the incident at hand. It’s a model that simply won’t scale.

A centralized observability team, with the mandate to enforce consistently formatted telemetry data, is key for a modern organization. After developing in-house observability in companies such as Twitter, Pinterest, and Slack, Suman Karumuri knows this firsthand. Suman Karumuri, Sr. Staff software engineer at Slack, outlines his vision: A centralized team specializing full time on observability tackles two key missions: maintaining the infrastructure and delivering related business value.

The Case for a Centralized Observability Team

Certainly, the responsibility of managing observability has grown so significant that a single engineer cannot keep it all in their head and practically address all the involved issues. On the scale of an organization like Slack, says Suman, there has always been a central observability team in some form.

These teams are known by different names in different organizations, such as Shared Services and Platform Engineering, to name a few. In other cases, this role falls under a centralized DevOps or SRE (Site Reliability Engineering) group. At even larger scales this broader observability approach most often extends even further: within the observability team there is a log management team, a metrics team, a tracing team, and so on.

That observability team is responsible for choosing the technical stack, driving development, and ongoing maintenance. Equally important, this team is in charge of disseminating the knowledge this work generates across the entire engineering organization. According to Suman, this requires teaching your engineers how to use the observability infrastructure and ensuring that they are following best practices.

The Need for Centralized Guidance within Engineering Organizations

Without centralized guidance from the observability team, each engineering team will decide on its own logging format or convention for labeling metrics. With so many “standards”, correlating events across team boundaries become even more challenging. In the midst of an incident, you don’t want to be asking your teams simply where the logs are and how they are formatted. You want your teams to have been following a standard that is not merely best for their purposes, but ideal for the organization as a whole.

In the worst-case scenario, Suman says, siloed engineering teams unaware of the observability infrastructure that your team has built can waste important time duplicating efforts – leading to the creation and upkeep of multiple monitoring stacks within the same company. Clearly, standardization of best practices and clear communications across the organization are critical roles that the modern observability team should play.

Further, crucial details such as GDPR compliance in logging should also fall under the purview of the centralized observability team. In today’s organizations, observability teams manage huge data pipelines used for complex analytics. In all of these pipelines, you must enforce GDPR compliance by removing personally identifiable information (PII), ensuring security tokens or sensitive data are not sent in logs, and avoiding painful data leaks. These issues can become even more complicated when sending logging and telemetry data off-site.

Although not every organization will need to roll their own observability infrastructure as Suman’s team has at Slack, a dedicated observability team can find the right technical stack for your organization’s needs, be it the ELK Stack, OpenSearch, Prometheus, or Jaeger. Finding the right fit involves analysis of the needs, patterns, velocity, and volume of your organization, he noted.

Suman offers the example of how to properly handle log storage: In the case that your log data is highly structured and the queries you run are purely analytical, it makes sense to store it in the data warehouse. On the other hand, says Suman, if your log data is highly unstructured you need something like Elasticsearch.

Observability as Analytics: The Mental Models

In a field with as many competing services, vendors, and packages as observability, it is helpful to keep one fundamental question in mind: what is the overall mental model we should be building towards? In my view, it is that observability is really a data analytics problem.

It is about getting data from many types, sources, and formats, meshing it all together, creating a conceptual data warehouse, and then being able to ask and answer whichever question I want to understand what goes on in my system.

Suman takes a similar view, noting that: “observability actually is a way to understand what’s happening in your system.” However, he places special emphasis on the ingestion stage: “I think a better model for observability is to think about gathering pre-aggregated or raw data and then ingesting that data to suit your queries.”

In summary

From technical maintenance to the distribution of standards, to maintaining legal compliance of telemetry data, observability has clearly become a specialization.

Similar to database administration teams, observability teams have increasingly become key players in enterprises and even mid-sized organizations. With this modern approach, the observability becomes not only a backend-oriented, log-metric-trace ordeal.

Rather, observability is the process of knowing and understanding what is happening in your systems. As Suman nicely noted, systems are not just limited to the backend, but also frontend systems, business processes, and user retention.

According to Suman “Even to some of the analytics use cases that people typically use, [that] people don’t even think of using observability systems for, it can be applied there.”

Want to learn more? Check out the OpenObservability Talks episode: Building web-scale observability at Slack, Pinterest & Twitter on: