Going Beyond Infrastructure Observability: Meta’s Approach

What’s the ultimate goal of bringing observability into an organization? Is it just to chase down things when they’re broken and not working? Or can it be used to truly enable developers to innovate faster?

That’s a topic I recently discussed with David Ostrovsky, a software engineer at Meta, the parent company of social media networks Facebook and Instagram among others. He was my guest on the most recent episode of the OpenObservability Talks podcast.

Before joining Meta, David worked as the chief architect for Proofpoint, a longtime provider of security, DLP and archiving for corporate email systems. During that time, the way Proofpoint conducted observability changed greatly, as they went from chasing the “survival level” of observability to being more mature. But, everything changed when he got to Meta.

“I thought I knew what observability was when I left Proofpoint,” David said. “I was very convinced that I had this thing completely solid. Then I joined Meta and discovered that there's a whole world of things you can do after you solve all the basic observability problems. Now I'm relearning what it actually means to have good observability when it's not just 100 million users, it's 2 billion users and you have essentially infinite resources to throw out the problem.”

How the Observability Conversation Starts at Meta

At Meta, David said the conversations about observability usually start with talking about SLAs. This has not always been his experience at other organizations, however.

“They start from a business discussion,” David said. “Then they break down into the normal chain of SLAs, then you find your service level objectives, then you go and find the right indicators, and you work from the business inwards. Whereas, in most other places I experienced observability, where you actually had to build the physical infrastructure. It always started from an engineering conversation of, ‘How do we monitor CPU usage? How do we make sure machines stay up? How do we detect the service crashing and starting?’ I’m not going to say always, but in most cases it wasn’t business-driven, it was just engineering-driven because you are first of all focused on the engineering cloud button.”

To put a fine point on it, David’s team at Meta actually doesn’t deal with the infrastructure. There’s a whole other group that sets up all the infrastructure, among which is the observability infrastructure. David’s team then becomes a consumer of a well-established infrastructure.

“My team deals with the decision-making layer in fraud detection, which is a lot of machine learning and heuristics,” he said. “We don’t deal with infrastructure, we use the infrastructure provided by Meta Infra Teams, and they use observability infrastructure provided by different observability teams. We can all access the shared UI tools together. If I want to create an alert, I can just go to the dashboarding tool and click a few buttons, and I have an alert.”

At Meta, other teams make sure the systems are working, and address the physical layer of observability, ensuring VMs are up and an application is actually working. This allows David’s team to focus on the business layer of observability, ensure the product is doing what it’s supposed to do, and that the different changes in input data or user behavior doesn’t change the logical behavior of the system in a way that negatively impacts their product.

“It’s a much harder problem because there’s really only so many ways physically that a software system can be broken,” he said. “You have your four golden signals and that covers 80% of all the different ways. But if you start thinking of how many different ways logically a system can misbehave, you can’t enumerate them. We keep coming up with new ways to break the logic of the system.”

This is something I often discuss with practitioners. Even if you are a small or medium sized organization, and cannot offload the “plumbing” and infrastructure aspects to someone else, you need to make sure you also address higher-level, higher-order observability needs of your product. Teams should want to ultimately move the needle for the business-facing decisions with their observability practice. This is achieved by intelligent data analytics over the telemetry data, which Meta seems to master.

Another important group of customers is the internal customers, namely your product’s developers. If you have an internal platform engineering team, you want to enable your developers to improve their experience and accelerate the development cycles. From my experience, developer observability plays a key role in these teams, and David shared that’s the case also at Meta.

“We really do start out with, ‘What is the impact we’re trying to achieve?’” David said. “‘What’s the customer going to experience, whether it’s the external users or internal customers who rely on our service? What do they expect? What do we promise the customers?’ We work backwards from that to what we need to look at as indicators, how we’re going to combine them together and predict or alert on them, and what’s the expected behavior? We don’t actually worry about whether the system is up. Of course, it’s up, it’s been up for a decade and there’s a thousand engineers making sure it stays up, and it’s not us.”

Want to learn more? Check out the OpenObservability Talks episode: Meta’s data driven approach to observability