Elasticsearch has long been the prominent solution for log management and analytics. Cloud-native and microservices architectures, together with the surge in workload volumes and diversity, have surfaced some challenges for web-scale enterprises such as Slack and Twitter.
My podcast guest Suman Karumuri, a Sr. Staff software engineer at Slack, has made a career on solving this problem. In my chat with Suman, he discusses for the first time in a public space a new project from his team at Slack: KalDB. KalDB is structured similarly to Suman’s earlier project at Twitter, LogLens, but incorporates lessons learned from that project as well as the known issues with the popular solution Elasticsearch.
Rediscovering Apache Lucene
At its roots, Slack’s KalDB, like Elasticsearch and Solr, is a Lucene-based indexing system. Apache Lucene is an open-source Java library for indexing and searching of textual data. According to Suman, Lucene is a more efficient architecture than most give credit for. In his words, “Most systems assume that Elasticsearch is expensive [to operate] because Lucene is expensive, that’s actually not quite true. Lucene actually is a very good storage engine and the architecture we had [at Twitter] for LogLens works fairly well even today.” In my opinion as well, the base Apache Lucene library is undervalued. Many assume that the shortcomings of Elasticsearch are indicative of problems with the Lucene library, but this is not the case. Both Apache Solr and ElasticSearch are based on the Lucene library, and Solr certainly provides features that Elasticsearch lacks. I also see organizations moving to developing directly on top of Lucene to gain better performance and reduced cost, such as Yelp engineering that built Nrtsearch to replace Elasticsearch for their needs.
Elasticsearch challenges and Slack’s KalDB approach
In our conversation, Suman is eager to point out that KalDB is designed to handle scalable logging volume. He gives the common case of log storms, a spike in logging volume like you might see during peak times (think first day of the year when everybody logs back in) or under a massive event such as outage or an infrastructure migration. As he says, that is a ton of log messages that your system has no control over. What Slack has done to solve this problem is to build KalDB to automatically scale by the logging volume. He notes that this is in contrast to pre-provisioned systems such as Elasticsearch.
Being cloud-native is another central criterion. Suman mentioned they found out that running Elasticsearch on Kubernetes is quite painful. Slack’s engineers designed KalDB to be cloud-native and run on Kubernetes to make operations management easy.
Other benefits of KalDB include the ability to handle field conflicts automatically by attempting to resolve them on read, as opposed to on write. In Suman’s words, “it’s easier to change your query to read the logs than it is to get that raw data.”
Multitenancy is also highly important for Slack, to make best use of the resource and yet to isolate the workloads so the load on one tenant does not impact the performance of another. In Elasticsearch today, you can have multiple indexes on a single node, but without isolation. According to Suman, KalDB offers multi-tenancy and isolation out of the box.
Resilience and avoiding single point of failure (SPOF) was another criterion. Suman points out that if a single Elasticsearch node goes down, it can delay all data ingestion during the downtime. In KalDB, he says, only a small subset of that data will be delayed. This feature is particularly beneficial in a high log volume incident, such as a log storm. Clearly, when the system is already behind, and a node goes down, this drags the infrastructure further behind. Failure cases like this show why Suman’s team felt the need to redesign this portion of their observability infrastructure from the ground up, instead of modifying or taking existing systems off the shelf.
Unifying Events, Logs, and Traces (ELT) data
The GitHub page for KalDB describes the system as an attempt to “unify events, logs, and traces (ELT) data under a single system.” The benefits of a unified system are significant, they say, including reduced infrastructure expense, simplified and non-redundant data, and more powerful and expressive queries.
In our chat, Suman states that one of his goals is “to make traces easy to consume [to] increase the value of traces.” One way that KalDB achieves this is the conversion of spans to SpanEvent data, which is a standardized format that simplifies data ingestion and queries. This data format, Suman says, has enabled Suman’s team at Slack to have end-to-end tracing in major processes, and has improved the tracing capabilities for mobile frameworks as well. According to Suman, the tracing capabilities at Slack are now such that tracing is used to monitor not only software but also modeling business processes.
The open-source path
KalDB shares some roots with the successful OpenSearch project, which started as a fork of Elasticsearch and Kibana. Both have features that have long been lacking in Elasticsearch, but successful development in these projects have enabled these sought-after features. KalDB, Suman says, “can definitely be open source.” He mentions that Twitter’s LogLens was never transitioned to open source due to his thinking that it was an “obvious idea [and that] somebody would build it better than I could.” This time, he wants the project properly open-sourced. At the moment, the source is available on GitHub, although the documentation is a little sparse. Nonetheless, it is simple to play with as a Maven project.
Want to learn more? Check out the OpenObservability Talks episode: Building web-scale observability at Slack, Pinterest & Twitter on: