Monitoring is often not the first thing on the mind of the modern developer. Yet, it’s necessary at many points of the software development lifecycle, including: before deprecating an API, before launching a new feature, after launching the feature, and more. In fact, monitoring needs can vary much more than the classic Ops monitoring.
There is one type of telemetry data that is highly valuable for developers, yet is rarely discussed in the realm of observability – application snapshots. It’s useful when investigating exceptions and debugging applications in production, to name a few usages. In this article I’d like to look into application snapshots, understand their benefits, and see if it can serve as a new observability signal, one that’s geared towards developers.
I recently discussed this topic on the OpenObservability Talks podcast with Liran Haimovitch, CTO of Rookout. On that episode we also discussed what observability means for developers, how to determine what they should be monitoring, how observability fits into current dev tools and processes, and how observability can actually be fun for a community that doesn’t typically put a premium on it.
In case you missed it, here is a summary of the first part of the conversation.
Liran is the co-founder and CTO of Rookout, a live data collection and debugging platform. He’s an observability and instrumentation expert with deep understanding of Java, Python, Node, and C++, as well as having broad experience in cybersecurity and compliance in his past roles. He’s also a fellow podcaster, and you can check out his podcast here.
In an ideal world, observability should be a no-brainer for developers, and be integrative in their process. I asked Liran about how we can bring observability closer to the daily work of the developer, and how to at least reduce the barriers of entry to observability.
“I would argue that we’ve optimized the pillar of observability for a service,” he said. “And we’ve first added metrics for monitoring at scale where logs fail us, and added tracing for monitoring in depth where complexity rises. We need to think about: ‘What pillars do engineers need? And what pillars would provide them with the best experience for their needs, that would be the most familiar for them?’ One of the things we found that works is that developers love application snapshots. They love to be able to tell the state of the application at a given line of code.”
This is part of the movement to go beyond the three pillars of observability, which is an important trend in observability. Focusing solely on logs, metrics and tracing signals limits our ability to provide full observability across the full lifecycle and across the organization, leaving developers underserved to a degree.
An application snapshot is a reference marker for an application at a particular point in time. It contains a copy of the application’s data together with the application’s current state, such as the exceptions, stack trace and the value of local variables.
Different signals serve different breadth or depth of system analysis. Metrics give a broad aggregative view of our application’s behavior, traces give request-wide observability, and logs give a deeper and narrower view of a specific service invocation. Snapshots give the next level of deep-dive, providing an even narrower and deeper inspection of a specific code invocation. You can think of snapshots as a modern version of heap dumps.
Snapshots have been around for a while, mostly as part of the exception tracking flow. If an exception was thrown, investigating it would become far easier by looking at the state of the application around the time it occured. Snapshots data is in fact used today behind the scenes by many error tracking and live debugging tools such as Sentry, Bugsnag, Backtrace and Rookout.
“The more we can collect about the state of the application, essentially taking a snapshot of the state of the application at that point in time, the easier it’s going to be for an engineer to come later on and analyze this exception, other than just reading in the log somewhere, ‘an exception was thrown,” Liran said. “The more accurate and the more full we can make this snapshot, the easier it’s going to be for developers.”
Liran said developers find the snapshots easier to use when it comes to seeing the full structure of an application to see how it got to the present state, or seeing what endpoint was involved, and especially for seeing all the values of all the app’s variables in context. It’s deeper, more comprehensive, and great for the day-to-day tasks engineers have to deal with and the questions they ask. Additionally, the application snapshot is going to look and feel much more like a live debug session than a traditional observability, which will be more familiar for a developer.
Given their nature, snapshots are hyper-focused and are data-heavy, and therefore require users to be very accurate about where and when they want to use them. This raises the need for dynamic observability.
“We need to be able to decide in real time where we want to get a snapshot from,” Liran said. “This brings to light the new concept of dynamic observability and real time instrumentation, essentially allowing you to decide in real time what data you want.”
Dynamic observability is a valuable concept for other telemetry signals as well. Just think of the situation of debugging a piece of code in production, and missing some logging data there. Adding log lines would typically require rolling out a new version of that service to production and restoring the issue with that version, which can take some time. Wouldn’t it be great if you could dynamically add that log line in real time?
In the first part of our discussion, we talked about the differences between developer observability and operations observability, and the different needs of the groups when it comes to observability. One of the things that came up was that the dynamic needs in observability are more necessary on the developer side than for operations in this case.
“As long as something is being executed, being able to instrument it and re-instrument it without having to stop the application, without having to change the state, let alone get snapshots from it, those are all super useful tools that we should strive to make the standard for every developer out there,” Liran said.
Developers feel comfortable with logging, but not as much so with other observability data. While we see the growing adoption of distributed tracing, I find that it is still far less intuitive for developers than logs. Inspecting the application state based on snapshot data can present a more intuitive experience and flow for developers, which is better aligned with their debugging practices.
I’d add that good integration with popular IDEs (integrated development environments) can significantly boost this experience. Native support of snapshots by the popular runtime environments is another critical aspect in growing the adoption, and in reducing the performance penalty associated with snapshots. Java presents a good example of such support, with instrumentation APIs having been available in the JVM for a couple of decades now, and with an ecosystem of supporting tools such as ASM. Not all programming languages, however, show that level of native support, with great variance between them, and each language and tool addressing it independently.
The lack of open source standardization around snapshots is a big inhibitor for adopting snapshots more widely. I’d like to see the vendors and end users in this space aligning to standardize and create interoperability across programming languages, runtimes and tools, in the same spirit as OpenTelemetry.
Want to learn more? Check out the OpenObservability Talks episode: Observability for Developers Demystified on: