Metrics storage engines must be specially engineered to accommodate the quirks of metrics time-series data. Prometheus is probably the most popular metrics storage engine today, powering numerous services including our own Logz.io Infrastructure Monitoring. But Prometheus was not enough for Slack given their web-scale operation. They set out to design a new storage engine that can yield 10x more write throughput, and 3x more read throughput than Prometheus!
In February 2022 Suman Karumuri, Sr. Staff software engineer at Slack, published a research paper together with fellow researchers from Brown University, CMU, Intel Labs, and MIT:
Mach: A Pluggable Metrics Storage Engine for the Age of Observability
Why did Slack need to invent the MACH storage engine?
In their paper, the team notes that metrics data has characteristics that can be difficult to handle if approached in a naive manner. These include the consideration of the patterns of data production in the time dimension (once per second, event-driven, etc.) as well as the potentially massive space dimension (i.e. data emission from a billion active sources). A storage engine designed with metrics storage in mind is a necessity. Today, Prometheus is a very popular solution, but Suman believes that there are weaknesses that need to be addressed.
Suman doesn’t beat around the bush when he discusses the shortcomings of the popular Prometheus platform. For example, he notes that automation combined with a lack of data consistency guarantees can easily lead to incorrect behavior. He also points out that the scaling behavior is not ideal, and that there are inefficiencies with “how many metrics they ingest, how many times they ingest […] I think there’s a lot of room for improvement there.”
It’s because of these shortcomings that his team designed a new storage engine, MACH, that he claims is “better than Prometheus.” That’s a bold claim, but he does appear to be able to back it up – their paper claims 10x write throughput and 3x read throughput over Prometheus. Their findings also seem to indicate that their loosely coordinated threading architecture scales very well with high thread counts. This is of course a boon now that 32, 64, or 128 core CPUs are not uncommon.
How does MACH achieve this performance?
10x write throughput and 3x read throughput over Prometheus! That’s an astonishing performance boost. Prometheus is by no means a poorly written engine, so I asked Suman to summarize how his team achieved such outstanding improvements. One of the first features he is eager to point out is support for tiered storage and a staged ingestion pipeline right out of the box. “What happens is you take the metrics, write them to disk, uncompressed. And then [you] compress them after you reach a few data points and [can] look back,” Suman says. Contrary to Prometheus, which has a fixed compression mechanism, this staged ingestion mechanism allows you to “pick a compression mechanism that is more suitable to the data.” Their paper gives the example of CPU core temperature readings which fall in a fairly narrow range, usually 25 – 50 C. The authors point out that this data could benefit from a different compression algorithm than is used for CPU utilization data.
In an extreme case, Suman points out, an incorrectly chosen compression mechanism can cause Prometheus to “explode.” Another performance boost comes from this staged compression, in that the data ingestion is only limited by the disk speed. This is again in contrast to an engine like Prometheus, which their paper describes as using “eager compression,” which can add overhead to the write sequence.
In the case of a Prometheus node going offline and needing to restart, Suman also notes that Prometheus then has to reprocess the backlog of data (the write-ahead log, or WAL), which can take “minutes, tens of minutes.” This can totally kill a 99th percentile (P99) performance metric for write throughput. In the case of MACH architecture, however, the data is ingested directly to the local disk, which enables a far quicker data recovery time in case of a node restart. These fast recovery times are not only valuable for performance but also vastly improve operational efficiency of the system.
Another major strong point that MACH offers is the ability to ingest multi-value time series for faster parallel ingestion. This is compared to Prometheus that can only ingest a single value time series at a time.
I know many of us are eager to dig into an exciting new technology such as MACH, but at the moment the code is not available. In our conversation, Suman promised a proof of concept implementation that will be open-sourced in the future. In the meantime, their research paper published in CIDR is sure to provide interesting reading and insights about this new technology.
Want to learn more? Check out the OpenObservability Talks episode: Building web-scale observability at Slack, Pinterest & Twitter on: