Nailing Observability at Scale with Logs, Metrics and Traces

Industry: Digital Media

Company Size: 100

Founded: 2011

HQ: New York, New York

Logz.io Products: Log Management, Infrastructure Monitoring, Distributed Tracing

Company Profile: Animated Mobile Messaging

Cloud Infrastructure: AWS

Holler is a messaging tech company that enriches conversations everywhere by creating and delivering useful, entertaining, expressive visual content to add texture and emotion to messaging environments. The company distributes iOS and Android SDKs to partner applications used by millions of people around the world.

Holler’s business is built on their content recommendation engine, as well as their in-house animation studio that produces stickers and GIFs for services like Venmo and iMessage.

Overview: 

Holler’s engineering team drives an observability strategy that is very similar to their operational strategy, with a focus on staying “as lean as possible to focus on product and machine learning pipelines to power all the cool stuff we want to do in our SDKs and APIs,” said Daniel Seravalli, Lead Engineer at Holler. This strategy includes a heavy emphasis on open source software like ELK for log analytics and Prometheus for metrics, as well as Kubernetes and other cloud-native solutions.

As the popularity of Holler’s services continued to grow, their infrastructure and architecture generated an increasingly massive scale of raw log and metric data that was difficult to analyze and troubleshoot in production using their open source ELK and Prometheus deployments. Moreover, management and administration of ELK, as well as Prometheus, became burdensome for their lean operations team.

To support Holler’s structured logging and metrics analytics initiatives, the company made the decision to transition to a managed and fully integrated offering from Logz.io. Early in 2020, Holler deployed Logz.io’s open source-based Log Management and Infrastructure Monitoring solutions to centralize and scale their event ingest pipeline.

“The big thing with Logz.io for us is unified logs and metrics. Prior to Logz.io, we really didn’t do a lot in the way of metrics outside of what CloudWatch gave to us. So with Logz.io, it was a big selling point to say, ‘Well, not only are you getting a managed ELK, but you’re getting a managed Prometheus, and we can help you manage and scale all of these together. And, over the past year or so of working with Logz.io, they’ve proven themselves time and time again to really know these technologies as true experts.”

The Initial Challenge:

Although Daniel and the engineering team preferred open source solutions, the raw scale of their messaging and user-based log data made it difficult to analyze via their DIY ELK Stack. Moreover, as traffic scaled and volume expanded, the small in-house operations did not have the time required to run a bulletproof ELK stack.

“Honestly, in the early days, ELK was just a dumping ground and not a strategic piece of our infrastructure. It was our default position that when we had a service generating some logs, we would just throw Filebeat on it to ship the data, but there wasn’t a strategy around structuring these logs or parsing them, or really thinking through when and where and what to log from our infrastructure.”

In addition, ELK performance was negatively impacting Holler’s ability to identify and resolve issues quickly, which negatively impacted user experience and ultimately drove excess costs.

“Kibana queries would take forever and this led to a really long time-to-resolution of issues, which in our case, usually costs us a lot of money.“

On the metrics side, Prometheus was providing Holler with a lot of value and insight into the company’s Kubernetes clusters and application performance, but management and maintenance was difficult and time consuming. Plus, using ELK and Prometheus means two separate systems to collect and monitor telemetry data – increasing overhead and prolonging incident investigation.

Naturally, Holler sought a solution that unified these different open source tools on one managed platform.

Unified Observability Platform Solution:

Ultimately, Holler’s observability journey evolved and transitioned to Logz.io as an alternative to ELK and Prometheus, which has helped to reduce some of the maintenance burden on the engineering team.

“We started looking at different providers and Logz.io stuck out immediately because it was ELK. We wanted to stay with ELK because we’re big believers in sticking with open source when we can. And Logz.io’s model of running open source software for cloud-native environments like ours is one we really like,” said Daniel.

Since deployment, Holler has worked closely with Logz.io to utilize tools to reduce their log volume. Using Drop Filters, Holler samples and filters data before indexing, breaking down the high volumes. In fact, with the support of Logz.io’s team, Holler’s log volume was reduced by over 80% to an average of 200 GB per day from an excess of 1 TB per day, saving the company significantly on processing costs.

“I’ll say the flexible commercial partnership with Logz.io has really helped us. Because it is so difficult to predict our traffic over time, this also means it’s hard for us to predict how much log data we’re going to be generating. And Logz.io works with us and lets us back into a contract over time and ramp up how much we’re paying for the ingest as we need it. So that’s been very helpful.”

Holler also piloted and deployed Logz.io’s Jaeger-based Distributed Tracing solution to help pinpoint production issues. “We’re doing more and more of our new cloud APIs through a microservices architecture on Kubernetes. So distributed tracing gives us more insight into the performance of our application through log-trace correlation. Again, Logz.io manages Jaeger for us, we just run the agents, collect the traces from our different microservices, ship that over to Logz.io, and then we end up saving more and more hours on operational stuff.”

To help take their observability program to the next level, Holler also saw value in correlating between all telemetry data types across logs, metrics and traces. A completely unified approach would further speed incident detection and resolution times.

On the metrics and Infrastructure Monitoring side, Logz.io was able to meet Holler’s needs through the beta program, and ultimate public launch of, Prometheus-as-a-Service. Designed to simplify the management of the Prometheus backend for customers, users can now easily ship their Prometheus metrics to Logz.io by adding just a few lines of code to their configuration files to get started in seconds.

“We had been a part of the Jaeger-based Distributed Tracing beta before and we had a good experience. So we felt comfortable with just saying, “Yeah, it’s a beta, but let’s just dive in. We can kind of have trust in the Logz.io team, that it’s going to work out. By the time it gets to GA, we know it’ll be a product that we can use at scale.”

In terms of deployment, Daniel stated that it was “a smooth experience” with support from the Logz.io team that happens quickly and effectively.” Next steps include migrating dashboards and testing the service in production and at scale.

“For me, it comes down to: Have you looked into trying to run a Prometheus time series database at  scale? Because, it sucks. I mean, if your company wants to do that, all the power to you, but that is the last thing a fairly early stage startup wants to be spending their time on.”

Business Impact:

Now, with Logz.io’s help, Holler has access to all of their observability data in one place to monitor the health and performance of their entire stack without the overhead of needing to manage multiple systems.

In addition to notable productivity gains, the faster performance and superior log aggregation through Logz.io compared to ELK, the ability to effectively analyze different data sources through a single source has enabled Holler to reduce the MTTR of data pipeline issues from days to minutes (an improvement of over 200% in response time).

Moving Forward with Prometheus-as-a-Service

Moving forward, Daniel and his team are also looking forward to leveraging Prometheus-as-a-Service’s built in alerting engine and 18 months of data retention.

“The alerting is going to be really big for us. Making sure the right people are getting the right data to make actionable decisions. Because, there’s kind of two broad categories of monitoring. It’s the infrastructure, Amazon down to Kubernetes, and then the application side where stuff could be going wrong. In both cases, we are using these tools to reduce MTTR and drive uptime.”

“I’m also looking forward to seeing more self-service from my team rather than me having to pull metrics for them and tell them what’s going on. So building flexible and insightful dashboards will be key here.”

Ultimately, for Daniel and Holler, the focus is being able to rely on Logz.io to offload management of Prometheus, while providing the expertise and support needed to ensure they are getting maximum value from the technology.

And, across Log Management, Infrastructure Monitoring and Distributed Tracing, Holler continues to praise the Logz.io team and their expertise at helping simplify observability and extract critical value.

“With Logz.io, you know they can run the tools flawlessly at scale and teach you how to use it, train your engineers, give you the query you need. If you can’t figure it out, they will say “We can try and do this.” Wait 30 minutes. Here’s the query. Here is the dashboard. That’s the kind of partnership and experience we have with their team.

 

You might also like

Form3

How Form3 Mitigates Issues Faster and Reduces Operational Costs with Logz.io

Mirakl

How Mirakl Monitors Hundreds of APIs with Logz.io

Sisense

Driving Customer Success at Sisense Using Logz.io