Check out Daniel’s upcoming talk on our September 3rd webinar, “Nailing ELK at Scale: A Real World Success Story from Holler“
Holler is a messaging tech company that enriches conversations everywhere by creating and delivering useful, entertaining, expressive visual content to add texture and emotion to messaging environments. As the company has continued to grow, the engineering organization has scaled to meet the demand for its services.
However, without a fully staffed Operations team, most of the engineers at Holler perform double duty across DevOps to keep the service performant for consumers. Fortunately, Holler has evolved its technology stack and the solutions that it uses for observability, performance visibility, and monitoring.
And, while the engineering organization at Holler has always had a preference for open source-based solutions, the team did meet challenges in scaling and managing the ELK stack to keep pace with their tremendous growth. In fact, the raw scale of log data generated by Holler messages made it difficult to analyze and troubleshoot production issues in a timely and cost-efficient manner.
As Daniel discusses in the Q&A below, the observability journey has evolved and incorporated a few different processes and solutions for metrics monitoring, log management, and analytics. Most recently, the company deployed Logz.io as an alternative to ELK, to reduce some of the maintenance burden on the engineering team. Holler has also worked closely with Logz.io to utilize tools like Drop Filters to reduce their log volume by sampling and filtering the data before indexing it.
In fact, Holler’s log volume was recently reduced to an average of 200 GB per day from an excess of 1 TB per day and Holler now has a more scalable and reliable logging pipeline that saves significant processing costs. The team is also better equipped to analyze and monitor the performance of its services via Logz.io’s unified logs and metrics observability platform.
Here is more from Daniel about Holler’s observability journey!
Can you share a bit about yourself and your professional background?
I’m a Lead Engineer at Holler and I primarily work on web services, cloud infrastructure, and most recently, DevOps. And since I’ve joined Holler, it’s been my goal to increase visibility into our systems and really reduce the amount of time it takes us to resolve problems. We’ve been growing very quickly over the last year, so there have been a lot of interesting challenges along the way, transitioning from a small startup to something that’s more mature.
Could you tell us a little bit about what Holler does for those that aren’t familiar with the company?
Holler is a messaging technology company and today, we primarily create and deliver animated stickers to a network of partner applications. Venmo is a notable partner, and you probably have seen our sticker content there.
And as a company, our goal is to build technology that enriches conversations. As more and more conversations happen in chat and digital environments, there are certain human aspects that tend to get lost in translation – such as body language and tone of voice. It’s simply more difficult to convey things like empathy and meaning through chat applications, and our content aims to bring some of that back into digital conversations. Our mission is to do this everywhere online and to be ubiquitous, so our model is to provide SDKs on both iOS and Android to our partner apps.
It’s also worth noting that the core functionality of our product is to provide relevant contextual recommendations of content. Our proprietary AI technology allows us to serve relevant content at just the right time, so that it’s useful to the end user. At the same time, while we are a tech company, more than half of our team is actually an animation studio. We produce all of our animated content in-house and we’re also looking beyond stickers too, into other types of content we can offer to enrich conversations.
Okay, from an engineering perspective, how does the tech stack look in terms of where you’re hosting and how you’re running your service?
We are on AWS and Amazon’s been really helpful in terms of all the different things they provide like auto scaling groups, managed container hosts, and managed Kubernetes. Today, most of our services are built in a more traditional fashion and our data pipeline is a pretty traditional Apache-based stack with Kafka, Storm, and Spark. And all of it is pretty much run by us, and we are in a big transition now to move towards containers and microservices. We’re looking very hard at ECS and EKS as part of that maturation process, as we try to use our engineering resources more efficiently towards product goals.
So speaking of that change that you’re trying to make, how are your teams structured today?
Our technology organization is split into two primary buckets. We have a research and data science team that operates out of California; they work on all the natural language understanding and technology that drives our recommendations engine. The engineering team, which I’m part of, is based here in New York City.
We’re a pretty small team, still small enough that we don’t have to break up into subdivisions on a sprint by sprint or quarter by quarter basis. We’re also a fan using “special ops teams” that come together as necessary to focus on projects for short time periods.
As part of your growth, obviously you’ve gone through an observability journey. Can you talk a little bit about your evolution and where you see things going around the ability to better monitor how your software is executed?
In a sense we have a cautionary tale, because for the longest time we didn’t really have much in the way of monitoring or observability. And that hurt us a couple of times as we grew. It seemed like one day we were a small company that didn’t really need these types of solutions to support visibility and availability. But before we knew it, we encountered issues that could have significantly cost us because the mean time to resolution was so high. And unfortunately, some of these problems included storing more data than we needed to, or the need to run a bunch of extra computations to handle strange behavior from the fleet of devices running our SDKs.
With our SDK, one small bug in our kit can generate a lot of extra data. And for the longest time we didn’t have a lot of visibility into our data management. But, we eventually realized we really had to speed up our resolution time, both for our customers and for our Amazon bill.
So what did you start with to help reduce costs and get a better hold on your monitoring and observability? Was it metrics and logs? How was the progression and what you were thinking around it?
Initially, our metrics came from CloudWatch’s free offering, which proved to be very limited in certain scenarios. I remember system memory not being part of the baseline metrics, even though there was an issue with our memory usage.
We also had our own ELK stack for the longest time for log aggregation, and that worked great. But as the traffic kicked up, and without an in-house operations team, we had trouble managing and scaling. Engineers were working double duty as operations and product developers without a true DevOps team.
Limits of Scaling DIY ELK
As that traffic grew, the ELK stack became pretty much unusable due to performance. And no one could spend the time to do the necessary things to Elasticsearch in order to have it scale appropriately. But we really liked ELK and wanted to stick with it.
Ultimately, finding a managed ELK provider, like Logz.io, was actually what helped us move past that hurdle. Because with all the knowledge we already had around ELK and Kibana, it felt safe. We liked the technology and having the management taken off our shoulders was critical.
Logz.io has been a cost efficient partner and has helped reduce our processing costs as well as the volume of the data we analyze to ensure we only pay for the logs that we need for troubleshooting purposes.
In addition to logs, we realized we needed more metrics. And because CloudWatch wasn’t giving us everything we needed, we started covering everything in the infrastructure with Metricbeat or Telegraf and started shipping that over to Logz.io to help with more unified observability.
You also mentioned the move towards microservices and where you’re going there. How do you think that’s going to change your observability goal? Are you going to introduce new types of data, any tracing or just new ways of doing data collection as part of that change?
Well again, looking back and seeing how painful not having coverage on services was, we’ve taken that as a lesson. We realized anything we build in our new microservices-oriented fashion has to have some certain things baked in. Now, it’s not an option to exclude them from projects. We also realized that microservices introduced complexity. Any distributed system, as it gets more and more distributed, has more network traffic that gets harder to manage. And if we don’t get the metrics right, we’ll be worse off than when we had no coverage from legacy systems.
So to that end, we’re taking our time with getting microservices right. At the start of this initiative, we want to focus on the microservices that will drive the product. We’re taking our time to initially build a framework that any engineer can use. We’re making it as hard as possible to get the observability piece wrong. There’s distributed tracing built in. There’s boilerplate code. There’s structured logging, so all the logging comes in a unified way with the same logging levels and different fields we consider key. The containers will have the metrics built in. So, anytime a new container’s just spun up on our Kubernetes cluster, then it’ll all be handled.
It sounds like the team is pretty focused and has been for quite some time on open source. Has that been a journey or were you always just completely open source-based from the beginning?
I’d say we were almost completely open source from the beginning, with a few instances where we looked at more proprietary technologies. Speaking of observability—New Relic, Datadog, and Rollbar—those tools are great. But we found that the pricing model, which is usually based on hosts, scaled poorly with our team. Because even though our host count grows, that doesn’t always mean revenue or the number of engineers is growing.
So, this becomes a cost that goes up, and up, and up. So, the open source technology scaled a lot more elegantly in that way. But, the flexibility piece is also huge. It’s a lot easier to hire DevOps people when you’re using technologies that most people are familiar with.
The ability to migrate is also key. With distributed tracing, we intend to use Jaeger and initially host our own Jaeger infrastructure with the idea that, just like the ELK Stack, it’s probably not something we’ll want to do for the long haul. We’ll need to find a managed provider, so that we can stay on Jaeger, instead of investing time and manpower and then moving onto some other proprietary tool. That would be churn that we don’t see any reason to take on.
Essentially, I think we are leaning towards primarily using open source technology, but not running or hosting it. We’d rather use Amazon’s managed Kafka rather than run our own Kafka. If it works, we’d rather use the underlying technology than something completely proprietary. In general, at this stage of the company we’re in a buy versus build scenario. We’re trying to be laser focused on the product, and on scale, rather than managing software ourselves. There is a breaking point where the costs can spiral out of control.
Any closing remarks that you had or anything that you wanted to mention?
Definitely. Holler is hiring at the moment. The biggest thing we’d love to have is a Keystone hire for our SRE/DevOps role. Like I said, we don’t currently have anyone focused on operation full time. We just have engineers playing dual duty, which is great because we’re all learning a ton. But now that I’ve spent time for the last few months wearing multiple hats, we’ve realized it’s such a cool space and it really deserves someone to focus on it to help take the company to the next level. Support and knowledge of Kubernetes clusters, infrastructure monitoring, Amazon Web Services are what we’re looking for. We’re looking for someone who lives and breathes this stuff and wants to level up our game.