Speed vs. Quality, Telemetry is the Key

By: Daniel Berman

Speed and Quality are Not Mutually Exclusive: Telemetry is the Key

Originally posted September 3, 2019 on Hackernoon by Daniel Berman

All engineering teams strive to build the best product they can as quickly as possible. Some, though, stumble into a false dichotomy of choosing between speed and quality. While that choice may have been necessary in the past, it’s not the case today.

What I’d like to do in this article is explain why.

By reviewing the relationship between frequent software releases and increasing software quality and the dependency with telemetry, I’ll try to help readers understand that frequent releases coupled with data-backed insights are the best way to succeed in today’s marketplace.

Pushing for faster value streams

Value streams define engineering work. A stream includes all the work required to produce the end outcome. The outcome may be launching a new product, releasing a new feature, or simply churning through support tickets. Irrespective of their specific job title, everyone in engineering participates in a value stream. Organizations have always wanted to have faster value streams, and why wouldn’t they? Working faster means earlier time to market, beating competitors, and putting the product or service into customers’ hands quicker.

Modern technology has turned every company into a technology company—whether they know it or not. In today’s market, building and shipping software successfully directly impacts the bottom line. The most successful companies differentiate themselves by building better engineering value streams. Their principles and practices of continuous delivery guided by real-time telemetry and continuous improvement have come to be known as DevOps. Indeed moving from waterfall to agile development and DevOps methodologies makes complete and undisputed sense. But if that’s the case, why are so many organizations still debating between speed and quality of their application releases? Is this not part of the same thing?

The research behind engineering performance

In their two books, The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations and Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations, Gene Kim, John Willis, Patrick Debois, Jez Humble, and Nicole Forsgren chart a clear course for improving engineering performance and debunk the idea that moving fast reduces quality.

Martin Fowler, the legendary software architect and author, claims that refuting this very idea is their most important contribution:

“This huge increase in responsiveness does not come at a cost in stability, since these organizations find their updates cause failures at a fraction of the rate of their less-performing peers, and these failures are usually fixed within the hour. Their evidence refutes the bimodal IT notion that you have to choose between speed and stability—instead, speed depends on stability, so good IT practices give you both.”

The findings in these research pieces are pretty convincing. High-performing teams, as compared to low-performing teams, deploy 46 times as often, are 440 faster from commit to production, 170 times faster in MTTR, and ⅕ as likely to encounter a failed deploy. These findings go hand in hand with Fowler’s assessment that good IT practices create speed and stability.

The practices also generate business results. Accelerate compared data from Puppet Lab’s State of DevOps reports over three years. They found that high performance teams had 50% higher market capitalization growth compared to lower performers. Also — high performance teams are twice as likely to exceed profitability, productivity, market share, and customer goals compared to lower performers. They’re “twice as likely to exceed noncommercial performance goals as low performers: quantity of products/services, operating efficiency, customer satisfaction, quality of products/services, achieving organizational/mission goals.”

If this is not enough proof, two case studies in the DevOps Handbook — Gary Gurver’s experience as the director of engineering for HP’s LaserJet Firmware division and Ernest Muller at Bazaarvoice — further drive home the point. They demonstrate that using an established set of technical practices creates both speed and stability. This is the basis of The DevOps Handbook’s first principle: The Principle of Flow. It focuses on improving time from development to production. However, the benefits are contingent on the second principle.

The Principle of Feedback

The Principle of Feedback allows teams to course correct according to what’s happening in production. This requires telemetry (such as logs and metrics) across the value stream. The idea goes beyond the simple approach of monitoring uptime. Integrating telemetry across the value stream allows development and product management to quickly create improvements whether it’s from a production outage, deployment failure, A/B tests, or customer usage patterns. Focusing on telemetry makes decisions objective and data driven.

Again, there are desirable knock-on effects across the organizations. Data aligns teams with objective goals, amplifies signals across the value streams, and sets the foundation for organizational learning and improvement.

But designing for feedback doesn’t just happen. It’s a by-product of strong leadership that internalizes the goal to create telemetry within applications, environments, both in production and pre-production, and in the deployment pipeline.

Scott Prugh, Chief Architect and Vice President of Development at CSG, said:

“Every time NASA launches a rocket, it has millions of automated sensors reporting the status of every component of this valuable asset. And yet, we often don’t take the same care with software—we found that creating application and infrastructure telemetry to be one of the highest return investments we’ve made. In 2014, we created over one billion telemetry events per day, with over one hundred thousand code locations instrumented.”

To resolve production incidents, high performance IT teams use telemetry 168 times faster than their peers with MTTR measured in minutes. Low performers, for contrast, had MTTR measured in days. This is no surprise when key business metrics are tracked, visualized, and monitored.

The following image is taken from Ian Malpass’ “Measure Anything, Measure Everything.” It displays successful and unsuccessful logins. Vertical lines annotate deploys. It’s immediately apparent that there was a problem before 7 a.m. that may be related to the last deploy. Teams can discern that information in seconds. That means a quicker diagnosis and paired with continuous delivery, faster resolutions.

Etsy embodies this philosophy in what became known as the “Church of Graphs.” Ian Malpass, the Director of Engineering, puts it simply: “Tracking everything is key to moving fast, but the only way to do it is to make tracking anything easy…We enable engineers to track what they need to track, at the drop of a hat, without requiring time-sucking configuration changes or complicated processes.”

Eventually they added enough tracking to make deployments safe.

Building proper telemetry infrastructure is a key decision. The telemetry system must take in numeric data and logs across multiple components and present it in a unified way. Many teams start with a simple Statsd solution since adding telemetry requires one line of code and a Statsd server to aggregate the data. That’s enough to start, but it omits a huge data source. All applications produce logs. The logs may contain the most valuable insights into performance and are vital for effective troubleshooting.

If adding instrumentation to code isn’t possible, it’s possible to extract telemetry from existing logs. Also, some failure scenarios cannot be instrumented in code. A log entry like “process 5 crashed from a segmentation fault” can be counted and summarized as a single segfault metric across all infrastructure. Transforming logs into data enables statistical anomaly detection, so changes from “30 segfaults last week” to “1000 of segfaults in the last hour” can trigger alerts.

The debate is still alive

As Dale Vile, CEO & Distinguished Analyst at Freeform Dynamics Ltd, explained in an article for Computer Associates, organizations are still deciding on the necessary trade off between speed or quality despite the above wide-scale agreement that the two should really go hand and hand.

The simple fact is that environments have grown even more complex with constant cycles of development, testing, releasing and ongoing support of the process. The growing adoption of microservices and an orchestration layer to manage the complexity has added a dynamic where speed is now inevitable and quality seems to be pushed by a plethora of vendors focusing on each part of this application release value chain. Every link in the chain is critical but gaining both high and low level visibility across all the constantly moving parts of the application delivery cycle is more challenging.

Logging, metric monitoring, and increasingly distributed tracing as well, are common practice amongst engineering teams. By definition though, the best we can do is react as quickly as possible once notified a log- or metric-based event occurred. This falls short of being proactive enough to ensure safe and continuous operations. To help engineers be more proactive, the next generation of supporting technologies need to be able to identify and reveal events already impacting the environment automatically, together with the context required to see the bigger picture.

Roi Ravhon, Core Team Lead here at Logz.io explains:

“In today’s world of orchestration of hundreds of microservices, it is impossible to reach continuous operations with the ability to proactively see errors and exceptions before they impact production. Indeed, with these insights, the faster application release that can be achieved, the better the quality of the final product will be. Without proactive insights, the opposite is true”

With this technological capacity, we can truly achieve our vision of continuous operations with the use of the modern technology stack – so long as it is enhanced with a layer of application insights that intelligently give us visibility into these errors and exceptions that have – and have not – been previously defined. In this state speed = quality.

The debate should be OVER

I think it’s time to finally bury the debate over speed and stability. The studies above demonstrate that applying The Principle of Flow via continuous delivery and backing with the Principle of Feedback via telemetry should produce both speed and quality.

Cloud computing, containers and container orchestration and serverless have helped increase access to continuous delivery. Monitoring technologies have evolved as well with an eye to automation and advanced analytics based on machine learning. Today’s engineering teams are better poised than ever to build robust telemetry systems. There’s a plethora of paid products for all classes of vendors, open source platforms, and a host of integrations for all popular languages and frameworks. These platforms take advantage of the current landscape offering drop in telemetry and visualizing for servers, containers, orchestrators, virtual machines, and databases.

Today’s systems create huge volumes of data, and data analysis tools must be able to keep pace. Automated metrics and trend prediction can amplify failure signals and reveal new ones that teams couldn’t find on their own. Advanced log analytics has become standard practice, helping teams to effectively dig deeper to discover the root cause in a fraction of the time this task used to take. In 2019, artificial intelligence is already adding even deeper visibility and providing the insights necessary to trust the speed and improve products safely.