How HotSchedules Reduced On-Call Alerts by 75%

About HotSchedules

HotSchedules provides the leading cloud-based intelligent operating platform, as well as supporting solutions and services for the restaurant, retail and hospitality industries. Designed for independents, multi-unit franchise operators and international enterprise brands,

HotSchedules serves over 2.8 million users across 159,000 locations in 56 countries, helping them to control costs, maintain compliance, improve visibility, grow top line revenue, and drive operational consistency.

Fighting for time

It was a battle for time. Tasked with developing HotSchedules’ rapidly growing Clarifi™ suite, HotSchedules engineers found themselves wasting valuable resources on what was becoming an increasing time-drain — maintaining and operating the logging infrastructure for their applications.

The logging system that HotSchedules put in place a few years back was based on Graylog and Elasticsearch deployed on AWS. This system worked great for a while, but with the substantial growth in the number of new services and the data they generated, issues such as capacity bursts and slow search performance began to surface.

The team found themselves busy worrying about the reliability and scalability of the application’s logging infrastructure instead of developing the application itself. As Denise Stockman, Senior Director of Infrastructure at HotSchedules, puts it: “As more services were added, we found ourselves investing more time on maintaining our logging infrastructure instead of product features. Maintaining the logging service became a distraction instead of a supportive service.”

When on-call alerts on over-capacity logging infrastructure mounted, HotSchedules’ engineers realized the time had come to change the way they thought about their logging system and begin looking for an alternative solution.

Benchmarking logging requirements

HotSchedules had now managed to identify precisely what was not working for them in their current logging system and formulated a list of requirements that could be tested across different logging systems and solutions.

Integration with existing technologies, community support, flexibility and avoiding vendor lock-in placed open source technology at the top of the list of requirements. The team was already using fluentd as part of the logging pipeline, so easy integration with this log aggregator was a must. They also wanted to integrate logging data into their existing Grafana deployment to present all information in a single dashboard.

Another key requirement was API support. The engineering team at HotSchedules is an API-first shop, and to facilitate automation, any solution HotSchedules ended up with needed to support API integration.

A cultural fit was just as important. The team was looking for a solution that was customer-driven, had robust support options, and that allowed building a relationship that was based on an open partnership.

To allow for data segregation, the ability to separate different development environments was an additional requirement of the team. Sensitive to data security, compliance was also crucial.

After benchmarking these requirements across various alternative solutions, Logz.io was the only one found to fit the bill.

Cross-company adoption

Making the move to Logz.io was straightforward.

Clarifi™ is based on a microservices architecture. The various services are written primarily in Java and node.js, generating JSON-formatted log messages aggregated with fluentd. A Logz.io plugin for fluentd is used to transmit the data into the service. To make the transition as smooth as possible, Logz.io helped in recreating Graylog searches in Logz.io and helped to champion training sessions for internal adoption.

In addition to applicative logging, HotSchedules also ships system data from a number of additional components in the architecture, including RabbitMQ and MongoDB logs. In total, approximately 200GB worth of log data is shipped daily to Logz.io. This data is used for troubleshooting and monitoring Clarifi™ by multiple teams at HotSchedules, including development, QA, product and support.

Automatic logging

From executing queries to creating Kibana objects and user management, the engineers at HotSchedules use API to automate tasks that would otherwise have required manual handling.

Denise provides one such example: “As part of the migration to Logz.io, we needed to create a number of accounts for our various users. To quickly get them set up, we used Logz.io’s User Management API to quickly create all of the accounts. We exported a list of users to a .csv file from our internal LDAP service and used a little one-liner (example provided below) to get the accounts created.”

Increased visibility

One of the points of friction with the previous logging services were disconnected data islands between logging and metrics. HotSchedules has a heavy investment in application instrumentation using Grafana and a number of data backends such as Graphite and InfluxDB.

To reduce the overhead with operational management of their applications, HotSchedules wanted to be able to build a  dashboard that combined logging and time series data.

Using  Logz.io’s integration with Grafana, HotSchedules is now able to present curated logging events within their existing Grafana dashboards to help surface issues quicker with less context switching and overhead to their application owners.

A productivity boost

Ultimately, the key motivation for making a change was productivity. HotSchedules engineers were desperately looking for a way to free up time so they could focus on what mattered most to them and the organization — develop and deliver new product functionality.

Adopting Logz.io has enabled HotSchedules to do just that by providing a logging solution that is both scalable and reliable, and that compliments the development methodologies in practice. The end result is a dramatic reduction in the time and resources spent on putting out fires related to the logging infrastructure. Based on the past 3 months of on-call data, HotSchedules assesses a 75% reduction in on-call alerts as a result of adopting Logz.io as the primary logging solution for Clarifi™.

Denise sums it up: “We now have much more trust in our logging capabilities. We no longer have to worry about waking up in the middle of the night due to oversubscribed infrastructure, and can focus on we do best — product development.”

Turn machine data into actionable insights with ELK as a Service