ZipRecruiter is a premier online employment marketplace that leverages artificial intelligence to actively connect job seekers with employers.
|Key challenges||End result||Data sources|
Challenges with ELK at scale
ZipRecruiter’s services are built on top of a hybrid architecture comprised of monolithic applications and microservices. These applications generate a large volume of log data and the ELK Stack was the weapon of choice to aggregate, store and analyze this data for monitoring and troubleshooting.
As ZipRecruiter’s services grew, so did the volume of log data being generated. The Elasticsearch cluster was ingesting 2 TB of data per day with 7 days retention when major production issues began to surface more often. The SREs at ZipRecruiter found themselves battling constantly with Elasticsearch performance issues such as high CPU and high memory consumption. The team even had to tackle occasional downtime of the Elasticsearch clusters.
As the team relied heavily on log data for identifying and troubleshooting production issues, ELK downtime was simply not an option. Coupling that with the amount of time and resources being spent on maintaining the ELK Stack, ZipRecruiter quickly understood a different solution was required.
Alon Becker, an SRE in ZipRecruiter’s Core team explains: “We were experiencing around 3-4 Elasticsearch-related incidents a month. Since we rely on our logging system for debugging our production environment, not being able to diagnose issues because Elasticsearch was down was highly disruptive and had a major business impact.”
Seeking a reliable and scalable alternative
The ZipRecruiter Core team started the search for an alternative logging solution on the premise that any solution that would eventually be selected had to answer two key requirements.
First, the selected solution has to fit in easily with the existing ELK-based logging infrastructure. The team did not want to spend time on deploying and training the team to use a new system. Second, the solution had to be able to perform at scale. ZipRecruiter required a solution that could handle TBs of data a day and not suffer from performance issues.
Offering a scalable machine data analytics platform built on the ELK Stack, Logz.io was considered as one of the leading alternative solutions. Alon recalls: “After a few meetings with Logz.io’s executives and architects, it became clear that the ELK-based architecture built by Logz.io was designed to handle the volumes of data we needed it to. The fact that we didn’t need to change anything in our existing logging infrastructure was a huge bonus.”
Making the transition to Logz.io
Migrating to Logz.io from ZipRecruiter’s existing logging architecture was effortless since Logz.io is built on top of the ELK Stack. No additional work other than changing the output destination for Logstash was required. There was also no need for additional training as the team was already accustomed to using Kibana for analysis.
Today, ZipRecruiter ships approximately 3 TB of log data a day into Logz.io. Approximately 100 services, written in Java, Python, Scala, Perl, C++ and Go, output application logs to files. Together with web server logs, this data is then forwarded to Logz.io — Filebeat is used to track the log files and forward the data, via Kafka, to ZipRecruiter’s Logstash instances which then route the data to Logz.io’s listeners.
Logging with confidence
The move to Logz.io showed immediate results. Queries in Kibana that used to be sluggish or even crash Elasticsearch, now execute in a flawless manner, making troubleshooting a much faster process.
With guaranteed scalability, growth in log volume or data spikes are no longer a concern. The ability to simply log what needs to be logged in order to gain visibility into the system without worrying about the underlying logging infrastructure, has allowed the team to move forward with other pressing projects and key initiatives. One such project involves migrating to Kubernetes, a project that requires careful implementation and a reliable log management system to be able to identify issues early on.
The SRE team at ZipRecruiter has managed to save valuable engineering time that would otherwise have been spent on troubleshooting and resolving ELK-related production issues. While the ability of the team to resolve these issues was never in doubt, the resources consumed to do so were.
Alon Becker sums it up: “We are a team that likes to be nimble and we realized that building an ELK Stack to scale was not our main responsibility. The overall service Logz.io provides has had a major impact on our work, allowing us to focus more on what matters – building, deploying and monitoring our product.”