Do Your Execs Know What It Takes to Manage ELK?

By: Evan Klein

Originally published at The New Stack.

We’ve all experienced it – executives with unrealistic expectations who vastly underestimate the amount of time our work can take. Most of us assume that to be the exception and not the norm. But when it comes to monitoring and troubleshooting, that seems to be the all too commonplace. A recent Logz.io survey of software engineers and the executives they report to suggests that engineers who troubleshoot and monitor their applications using open source tools like the ELK stack are spending far more time maintaining these monitoring systems and troubleshooting production issues than their executives imagined.

The result? Misalignment, overspending, understaffing, and even burned out engineers.

Open Source FTW

Open source has become the staple of the machine data analytics world – overcoming proprietary solutions to become the standard by orders of magnitude. It was recently reported that the ELK stack had been downloaded over 100 million times! And it’s obvious why: the ELK stack is easily deployable, free to download and use, has a growing community, and provides some powerful analytics out-of-the-box.

More Time Than We Gambled For

But ELK is not without its challenges. While it’s quite easy to set up, it can be difficult to scale and many engineers find it ends up requiring a significant amount of time to maintain. Performance issues, outages, upgrades—all this requires time and effort from engineers. It’s absolutely crucial work and certainly recognized as such, but it seems the sheer workload is being discounted in the minds of IT and engineering executives. Here’s what we found:

Surprising Results

DevOps engineers are spending on average over 10 hours per month fixing ELK-related performance issues. Their executives told us they were spending less than 5 hours. That’s a 52% underestimation of the time their teams are spending! Want some more cringe-worthy stats? Engineers reported ELK crashing over 1.7 times per month, something their execs reported only about once per month. The only thing IT leaders and their teams agreed on? They upgrade ELK just over 2.5 times per year on average.

Put it all together: engineers spend almost 7 hours every week managing ELK upgrades, incidents and crashes alone—40% more time than their executives realize! Add to that the 10% more time it takes to troubleshoot production issues (we’re trying to find a needle in a haystack here people, have some patience!). No wonder they expect us to get more done than we’d say is realistic.

Let’s Get Honest

It’s time to set the record straight. Maintaining ELK takes time, and IT executives need to know that. It’s not that you can’t do the work; you can. You just need the right tools: hosted ELK, or better yet fully managed ELK-based monitoring and troubleshooting tools can really cut down the amount of time spent on maintenance work. But we also need better communication and more transparency in our organizations. Don’t leave it to your managers to manage up and temper expectations. That’s clearly not happening. Engineers, push to (or push your managers to) open up lines of communication with IT leaders. Managers, keep tabs on the amount of time your teams are actually spending on maintenance vs. delivering new value. And IT leaders, keep a pulse on your team, ask your engineers questions to uncover misalignments, and create transparency in your organization.

Take a look at the results of Logz.io’s Managing ELK Stack survey here.